Very interesting results, and it's the first benchmarks I've seen of AI/LLM.
M4 Max actually manages to outperform the RTX 4090 once model sizes become too big for the 4090's RAM pool.
That's what tensor parallelism is for in AI/LLM ... just plug in more RTX 3090s (don't really need 4090 with low batch sizes). Even across PCIe tensor parallelism works good enough.
Though unless it's for porn, you can just do it in the cloud too.