How are they getting the same throughput then?
I’m going off AMD’s numbers but I’m not sure how they’re calculated. GB203 does 1024 FP16 TOPs per clock per SM and an N48 CU hits the same number with only 128 FP32 ALUs. They’re squeezing 8 FP16 ops out of each ALU. This is without sparsity. Black magic maybe.