This is where you had lost me before, or maybe I should say the documentation because if its actually 8 per cycle per fpu then they should say that.
Oh, I'm sorry. I can see how my answer was less than helpful.
This is because in x86 nomenclature, a single SSE "bundle" is a single op working on a 128-bit quantity. The fact that some ops treat their arguments as 4 distinct floats is just a detail about the op. So when you posted the quote, I frankly did not understand what you were talking about.
So you are saying that both 128bit threads are run per cycle from the partitioned 256bit load? If so, then this would explain our difference. Though again the documentation is not clear:
"Only 1 256-bit operation can issue per cycle, however an extra cycle can be incurred as in the case of a FastPath Double if both micro ops cannot issue together."
Any 256-bit ops are cracked in the decoder before issue into two separate 128-bit ops, which then issue to the 128-bit units whenever they can.
The front end is 4 instructions per clock, alternating between threads.
AVX-256 is not necessary to provide enough instructions to hit peak FP throughput. According to Agner Fog, it is counterproductive.
Agner Fog's guide for BD is out? How did I miss that?
You are, of course, correct. I was referring to the theoretical case where you want to use both FMA pipes, store, and loads simultaneously. Turns out you can't actually do that anyway -- according to Fog, BD can do either 2 loads or load+store, not 2 loads + store like SNB. (SNB has just 2 agus, but when loading 256-bit quatities, or doing unaliased loads, it can use one agu to drive both load pipes.)
This somewhat mystifying to me. K8 used a similar scheme to crack long FPU ops to half, and there you could decode as many double instructions per clock as you felt like. I frankly just assumed that BD was the same way.
Chalk that latter nugget up on the list of unexpectedly weak things about Bulldozer.
The list is getting rather long.
Still, this is relative to Xenon, which is a bar BD should be able to clear.
Of course. AMD might be playing b series compared to Intel, but Xenon (and most other options) are enrolled in the special olympics. What store-to-load hazards? It's not like
anyone would ever like to read data nearby to data they've just stored. I mean, the c stack is just so
passee.