That XMX throughput is assuming the same capabilities as for the Matrix Engines in Ponte Vecchio. When asked, Intel reps would not give more details on Alchemist than what was presented in the slides. That may have been different for US press, but if so, they should be explicit about it.
It is possible, if unlikely, that for conumser grade GPUs Intel chose to have pure inference engines there while Ponte Vecchios HPC-style XMX are half as many, twice as wide and churn out 2048 ops/clk on TF32 (!), 4096 on FP16 & BF16 as well as 8192 on INT8. It's Vector Engines have been refactored too (8 x 512 Bit vs. 16 x 256 Bit), if that's any indication.
It is possible, if unlikely, that for conumser grade GPUs Intel chose to have pure inference engines there while Ponte Vecchios HPC-style XMX are half as many, twice as wide and churn out 2048 ops/clk on TF32 (!), 4096 on FP16 & BF16 as well as 8192 on INT8. It's Vector Engines have been refactored too (8 x 512 Bit vs. 16 x 256 Bit), if that's any indication.