Analysis of the AVX-512 implementation.
tl;dr:
It's overall very good.
Peak FMA throughput, and peak memory throughput are half of Intel's full-fat AVX-512 implementations, but despite this it will on many loads beat them. Latencies are low, and especially shuffles and math on masks are substantially better than on Intel, and unless you are doing BLAS or something, it's possible for these "ancillary" things to overcome the deficit of raw computational throughput. It really helps that the core doesn't downclock to do AVX-512 -- in fact clocks are higher for 512 than they are for AVX256 (because frontend has half the work to keep backend occupied).
As a negative side, VGATHER* are still comically slow, and unlike Intel the core is register port limited if you are doing FMA with masked result merging.
tl;dr:
It's overall very good.
Peak FMA throughput, and peak memory throughput are half of Intel's full-fat AVX-512 implementations, but despite this it will on many loads beat them. Latencies are low, and especially shuffles and math on masks are substantially better than on Intel, and unless you are doing BLAS or something, it's possible for these "ancillary" things to overcome the deficit of raw computational throughput. It really helps that the core doesn't downclock to do AVX-512 -- in fact clocks are higher for 512 than they are for AVX256 (because frontend has half the work to keep backend occupied).
As a negative side, VGATHER* are still comically slow, and unlike Intel the core is register port limited if you are doing FMA with masked result merging.