Could you clarify what you mean by that? How are those two descriptions different, what is the baseline?
For full disclosure, I have no idea how these things can be implemented in hardware. It was pointed out (
https://www.realworldtech.com/forum/?threadid=197759&curpostid=197993) that fusing two FP16 ALUs to perform a FP32 operation or splitting a single FP32 ALU to perform two FP16 operations per clock are functionally equivalent, something I was not aware of.
What I mean essentially is that Apple GPUs are exposed as having SIMD width of 32 on both the A14 and M1 (the core is understood to have four SIDM units for the total width of 128). The SIMD width is a constant, regardless of whether you are funning FP16 or FP32 operations, and it supports all the regular bells and whistles like SIMD lane masking and various kinds of SIMD broadcasting and reduction operations. Regardless of how it is implemented (many people here will probably have a much better idea than me), the end result is that A14 (mobile chips) can run FP16 at full speed and FP32 at half the speed (whether they achieve it by using two FP16 ALUs or by using the same FP16 ALU over two cycles or something else is just an implementation detail IMO) while the M1 seems to have wider ALUs capable of doing FP32 FMA at full rate — without increasing it's relative FP16 rate. I would further speculate that M1 maybe does not have any special support for FP16 at all and simply does a FP32 computation while widening FP16 inputs and rounding the output back to FP16.
I think what my results primarily suggest is that Apple does not have separate FP16 and FP32 units like some other implementations, they reuse (parts of) the same ALUs to perform computations with various precision.
Can you run on your M1 system eg. 3DMark Wildlife for iOS? Or GFXBench 5 for iOS vs. for macOS (not m1 native)?
https://www.notebookcheck.net/Apple...s-Intel-and-AMD.508057.0.html#toc-performance
There are a lot of results available for these benchmarks already. Is there something specific that you are looking for? I don't think it's surprising that M1 Mac outperform the iPad, they have twice as many GPU cores...
If I interpret Rogue diagrams correctly, they seem to suggest that IMG has separate FP32 and FP16 that can execute operations in parallel. My results for Apple GPUs suggest that they are much "simpler". I don't believe they have separate FP16 and FP32 ALUs at all, just some ALUs that can run either FP32, FP16 or integer operations.
I try to understand your numbers, and it looks for me, that a M1 GPU core is not like an Apple A14 GPU core. I understand your numbers, that an A14 GPU core has 64 fp32 EUs (and 128 fp16 EUs).
My point is that M1 GPU core is physically wider, as it can do 128 FP32 FMA operations OR 128FP16 operations per clock. A14 core can do 128 FP16 FMA operations per clock OR 64FP32 FMA operations per clock. As I wrote above, I have no idea how this is actually implemented in hardware. Also, looking at die shots, the M1 core GPU is much larger physically relative to an A14 core.