That single-core score for the Snapdragon X Elite was obtained under Linux. It's less under Windows as shown above.
Doesn't the mobile variant only have 8GB VRAM? Also what version of Blender? What scene? It's silly to expect MetalRT to be as mature as the NVIDIA backend at this point of time.
Apple did ditch Imagination in 2017, following their 2014 "multi-year deal". In early 2020 Apple and Imagination signed another "multi-year deal", but it's not known for how long. It seems inevitable that this latter deal is on borrowed time so how long this situation will continues is presumably how confident Apple's IP lawyers are that their current/next GPU tech is sufficiently different in implementation from what they've previously licensed from Imagination.If they did that is. They tried to ditch Imagination once, came crawling back and there hasn't been any news of them abandoning the renewed licensing deal, has there?
Interesting that Apple unifies the (main?) register file, local memory and data cache to share the same scratch memory pool.Here is a quick rundown of what changed in the GPU of the M3 series and A17 Pro.
Besides hardware-accelerated ray tracing, hardware-accelerated mesh shading and Dynamic Caching.
- Dynamic register memory improves occupancy
- Large on-chip cache available to all memory types
- Occupancy changes dynamically to memory usages
- FP16, FP32, and integer operations can execute in parallel
That isn’t necessarily a contradiction. You could attain better performance through better occupancy, memory access pattern and lower power usage enabled by FP16, despite leaving some ALU throughput unused.What is curious is that they're saying on the one hand to prefer using FP16 wherever possible (to reduce bandwidth/space requirements) but there is no mention of the FP32 pipeline being able to perform FP16 operations, only that "conversion is free". Thus you might get peak throughput only with a 50/50 mix of FP16 and FP32 operations, which would contradict the recommendation.
I can believe that in some cases, but I can absolutely max out the FP16 ALU on M1/M2 without being limited by occupancy, memory, or power. Given the unification of register file and threadgroup memory in M3, occupancy should be much less of a problem, and M series GPUs have way more memory bandwidth/flop than, say, Nvidia's tensor core GPUs. A doubling of ALU throughput would be more than welcome.That isn’t necessarily a contradiction. You could attain better performance through better occupancy, memory access pattern and lower power usage enabled by FP16, despite leaving some ALU throughput unused.
Peak throughput with FP16 and FP32 are identical on M1, and you can achieve very close to the theoretical peak with a kernel that does a lot of simdgroup_multiply_accumulate and very little memory access (which isn't particularly useful in practice). For practical use (i.e. a real matmul) you can get ~90% ALU utilisation using FP16 and ~80% with FP32, if I remember correctly. It's been a while since I've worked on Metal kernels, but the difference between FP16 and FP32 on M1 isn't anywhere close to a doubling.It is also possible that the FP32 ALU pipeline is e.g. half as wide as the hardware SIMDgroup, so it takes twice the cycles to execute a FP32 instruction, which would explain FP32 being 50% the peak FP16 throughput on M1 (?).