Apple is an existential threat to the PC

If I'm not mistaken, the Snapdragon X Elite has higher performance core performance than the M3?

Single core performance is probably going to be quite close. Multiple core performance now looks like to be similar to M3 Max, according to the core counts.
 
Even though I promised not to visit this thread again with a 10-foot pole I feel something needs to be said.

That Geekbench 6 single core score is obtained under Linux, which is a much lighter OS. In Windows the score suddenly isn't all that impressive when taking into consideration the 600MHz clock speed difference.

The GPU performance is 1/3rd of the M2 Max as well, which has just been replaced by the M3 series.

Apple quotes a 15% performance uplift (in line with what TSMC has said about their N3 node) which would put it inline with the X Elite Geekbench 6 score with a product that actually is available today.

Screenshot 2023-10-30 at 15.01.19.jpg


Also why doesn't this Qualcomm business have its own thread?
 
Last edited:
It appears they are clocked the same in the laptops at 4.05GHz.

Geekbench 6 score for the Mac15,9 (MacBook Pro M3 Max) can be found here and scores for the Mac15,3 (MacBook Pro M3) can be found here.

They are all around 2950 to 3150 for single-core scores. The lower scores probably indicate the system is still in the process of Spotlight indexing and will slowly levitate towards the higher score, just like the A17 Pro scores has.

The M3 Max matches the M2 Ultra in multi-core score ~21,000, so that's pretty nice for a 16-core CPU (versus 24-core of the M2 Ultra).

The M3 nearly matches both the M1 Max and M2 Pro (10-core version) in multi-core score. which being significantly faster in single-core performance.

First PugetBench for Photoshop results are also coming in here with the base M3 hitting overall score around ~1100.
 
Last edited:
The highest score for M3 is essentially equivalent to the X Elite in single core performance.
That single-core score for the Snapdragon X Elite was obtained under Linux. It's less under Windows as shown above.

Snapdragon X Elite
3239 (Linux)
2980 (Windows)

It appears the scores for the M3 are starting to gravitate towards the higher end now that the new machines are done with background tasks like Spotlight indexing etc.
 
Yeah M3 is performing up to expectations, essentially equal in single core to the X Elite (9 point difference) under ideal circumstances (benchmarked on Linux with fans at 100%), all while clocked ~300 MHz lower.
 
Solid review from Geekerwan on the M3 and M3 Max, including core cluster to core cluster latency measurements, SPEC CPU 2017, Blender CPU/GPU, Cinebench 2024 CPU/GPU, Cinebench R23, Cinebench, 7-zip, DaVinci Resolve, Adobe Premiere Pro, Adobe After Effect, Adobe Media Encoder, Xcode, 3DMark, Stable Diffusion and games.


BlenderGPU.jpg
Cinebench2024GPU.jpg
SPECCPU2017.jpg
Geekbench6CPU.jpg
 
Last edited:
From the german reviewer iKnowReview:
1699348727136.png

3090 and the 4070 is a mobile version. Still, you pay ~3x more for a M3 Max Macbook tahn for a proper notebook with a 4070...
 
From the german reviewer iKnowReview:
View attachment 9984

3090 and the 4070 is a mobile version. Still, you pay ~3x more for a M3 Max Macbook tahn for a proper notebook with a 4070...
Doesn't the mobile variant only have 8GB VRAM? Also what version of Blender? What scene? It's silly to expect MetalRT to be as mature as the NVIDIA backend at this point of time.

It seems like a fine 1st generation implementation of hardware ray tracing. This honestly isn't the reason to buy a Mac but their first foray into it is quite respectable.

The power numbers might also be interesting. No one expects this solution to beat stand alone GPUs that can consume more power than the entire SoC.

About your "a proper notebook" comment is that really the level of discussion we want here at B3D?
 
Here is a quick rundown of what changed in the GPU of the M3 series and A17 Pro.

Explore GPU advancements in M3 and A17 Pro.

  • Dynamic register memory improves occupancy
  • Large on-chip cache available to all memory types
  • Occupancy changes dynamically to memory usages
  • FP16, FP32, and integer operations can execute in parallel
Besides hardware-accelerated ray tracing, hardware-accelerated mesh shading and Dynamic Caching.
 
Now, when they designed the shaders themselves, is Apple completely free of PowerVR IP?
If they did that is. They tried to ditch Imagination once, came crawling back and there hasn't been any news of them abandoning the renewed licensing deal, has there?
 
If they did that is. They tried to ditch Imagination once, came crawling back and there hasn't been any news of them abandoning the renewed licensing deal, has there?
Apple did ditch Imagination in 2017, following their 2014 "multi-year deal". In early 2020 Apple and Imagination signed another "multi-year deal", but it's not known for how long. It seems inevitable that this latter deal is on borrowed time so how long this situation will continues is presumably how confident Apple's IP lawyers are that their current/next GPU tech is sufficiently different in implementation from what they've previously licensed from Imagination.

If Imagination are actually providing technology to Apple, rather than this deal being one of convenience to stave off litigation, then I've not read anything about it.
 
Here is a quick rundown of what changed in the GPU of the M3 series and A17 Pro.

Explore GPU advancements in M3 and A17 Pro.

  • Dynamic register memory improves occupancy
  • Large on-chip cache available to all memory types
  • Occupancy changes dynamically to memory usages
  • FP16, FP32, and integer operations can execute in parallel
Besides hardware-accelerated ray tracing, hardware-accelerated mesh shading and Dynamic Caching.
Interesting that Apple unifies the (main?) register file, local memory and data cache to share the same scratch memory pool.
Quite similar to what Nvidia back in 2010 thought their 2018 GPU (Echelon vision) could have looked like.
 
What is curious is that they're saying on the one hand to prefer using FP16 wherever possible (to reduce bandwidth/space requirements) but there is no mention of the FP32 pipeline being able to perform FP16 operations, only that "conversion is free". Thus you might get peak throughput only with a 50/50 mix of FP16 and FP32 operations, which would contradict the recommendation.
 
What is curious is that they're saying on the one hand to prefer using FP16 wherever possible (to reduce bandwidth/space requirements) but there is no mention of the FP32 pipeline being able to perform FP16 operations, only that "conversion is free". Thus you might get peak throughput only with a 50/50 mix of FP16 and FP32 operations, which would contradict the recommendation.
That isn’t necessarily a contradiction. You could attain better performance through better occupancy, memory access pattern and lower power usage enabled by FP16, despite leaving some ALU throughput unused.

It is also possible that the FP32 ALU pipeline is e.g. half as wide as the hardware SIMDgroup, so it takes twice the cycles to execute a FP32 instruction, which would explain FP32 being 50% the peak FP16 throughput on M1 (?).
 
Last edited:
That isn’t necessarily a contradiction. You could attain better performance through better occupancy, memory access pattern and lower power usage enabled by FP16, despite leaving some ALU throughput unused.
I can believe that in some cases, but I can absolutely max out the FP16 ALU on M1/M2 without being limited by occupancy, memory, or power. Given the unification of register file and threadgroup memory in M3, occupancy should be much less of a problem, and M series GPUs have way more memory bandwidth/flop than, say, Nvidia's tensor core GPUs. A doubling of ALU throughput would be more than welcome.

It is also possible that the FP32 ALU pipeline is e.g. half as wide as the hardware SIMDgroup, so it takes twice the cycles to execute a FP32 instruction, which would explain FP32 being 50% the peak FP16 throughput on M1 (?).
Peak throughput with FP16 and FP32 are identical on M1, and you can achieve very close to the theoretical peak with a kernel that does a lot of simdgroup_multiply_accumulate and very little memory access (which isn't particularly useful in practice). For practical use (i.e. a real matmul) you can get ~90% ALU utilisation using FP16 and ~80% with FP32, if I remember correctly. It's been a while since I've worked on Metal kernels, but the difference between FP16 and FP32 on M1 isn't anywhere close to a doubling.
 
DF compared RTX 2060 to the M1 Max in Mac Book Pro 16 inch (32 Core GPU), both delivered comparable performance at 4K in Death Stranding, which uses native Apple implementation.


Also for reference, the Max Studio with M1 Max (32 Core GPU) delivers the same performance at native 4K, if anyone is worried about thermal throttling affecting the results.

 
Last edited:
That's a pretty solid result, all things considered. I know the M1's are also considered a great "less expensive" way to get LLMs completely into a high bandwidth memory working set, versus trying to source an add-in video card with 24GB or more in VRAM.
 
Back
Top