Ugh, wrote a long reply a while ago and apparently it disappered somewhere.
It looks like AVX256 is indeed split into two macro-ops and that there's a bottleneck that allows only two macro-ops to be dispatched per cycle. Because of this I could see AVX256 usage being detrimental vs 128-bit instructions, but AVX128 could still be advantageous.
Especially given how ridiculously overprovisioned the fetch and scan are. 32B per clock, scan instruction from worst case 17 bytes... that's better than any Intel frontend ever. Using longer instructions (like 3-op AVX) instead of more shorter instructions should basically always be a win. Long immediates should also never hurt.
Another thing I noticed is the L2 latency is pretty damn bad.. no less than 25 cycles load to use (I guess that includes the L1 miss cycles as well).. so at best no better than Bobcat, but maybe slightly worse?
Yep, although I think the L2 can sustain more misses at a time, and there is more BW.
I'm not sure if two macro-ops per cycle retirement is much of an issue. The whole design pretty much looks like it's optimized for 2 of them per cycle. The requirement that the fast path double macro-ops retire "simultaneously" shouldn't hurt much neither I think (after all they have all the same dependency and latency so they should be able to execute right after each other), but yeah given that the only advantage of avx-256 is lower instruction count / lower instruction size in the first place there's probably not much point for avx-256.
What I was interested in was if it would be possible to do all your housekeeping ops for free without having to unroll much. Seems that getting all the throughput out of the FPU takes all of the frontend and retire BW. At least we have free load ops/insn
Some instructions also got notably faster. Ok we knew about the integer divider already (roughly twice as fast as previous int-domain divs), but omg 2 popcnt/lzcnt per clock with 1 cycle latency?
I'm pretty hyped about that one. popcnt can be used to implement a fast space-efficient immutable structurally sharing hash trie, which is a great complex data structure for mt work.
palignr ... now executes at 2 per clock
Is it used for anything other than avoiding cacheline-crossing load penalties on Intel CPUs? Boundary-crossing unaligned loads have always been comparatively very cheap on AMD, so I don't think there's that much use for it anyway.
It's also got _very_ good horizontal operations like haddps (worlds better than Bulldozer, and in fact better than even Ivy Bridge
Yeah, what is it, 16 horizontal sums of PS in 12 ops over 15 clocks? With half the loads baked in? Leaving half the frontend and FPU free for other work? No reason ever to not use horizontal ops.
Incidentally, this more or less proves that the FPU is new, and not a derivative of any of AMDs previous ones. Their other FPUs implement 128-bit as two physically separate lanes of 80-bit and 64-bit. No way can you do this fast horizontals unless they are close together.
even DPPS which is quite a mess on Bulldozer looks quite decent
If I understand the "no local forwarding" hit right, it's exactly the same cost in latency as if you do it by hand. However, it issues 2 more macroops, and that probably completely occupies the frontend for 3 clocks, meaning the handcrafted version is probably better unless you want the masking/result placement properties.
And it keeps the incredible 2-cycle simd mul latency - still wondering how AMD do this when they only manage 3 cycles for an add...
Isn't FP single multiply is an easier operation than FP single add? 24-bit mul + round vs match, add, round.