I'm more intrigued by Jim Keller's original comment that CUDA is "not beautiful".
This might be true in an absolute sense, but relative to all of the alternatives for HPC, I find CUDA a *lot* more than beautiful and powerful than everything else I've looked into... part of the reason it has been successful overall (rather than in AI specifically) is that the alternatives have consistently been worse, partly as a result of needing to be multi-vendor, partly due to lack of vision and bad luck, and partly because CUDA did makes a few very good choices. I'm including OpenCL/HSA/OneAPI/SYCL/Vulkan Compute/etc and directly programming multicore SIMD on x86/ARM into that bucket (I haven't looked into ROCm/HIP enough to have a strong opinion yet but either way that is much more recent). I'm also honestly not sure what I would even change to make a "better CUDA for HPC" as most of what I'd have in mind would just be higher levels of abstraction built on top of what we already have.
For AI specifically, there's definitely an argument that CUDA is a weird level of abstraction, and that the moat isn't as strong as it seems. But CUDA wasn't originally built for AI specifically, it was built for HPC in general, and in that context I think they've done a fantastic job compared to literally everyone else in the industry.
---
I am also skeptical that OpenAI's Triton is a better level of abstraction for AI (it is now pytorch's default backend by the way so very important in the industry not just for OpenAI).
In NVIDIA's case, with the TMA accelerators in Hopper, they cannot efficiently support a "prologue" that does preprocessing of the inputs before a GEMM, because getting the data into shared memory is effectively fixed-function now. NVIDIA's CUTLASS has for many years only supported an epilogue (post-GEMM, e.g. apply an activation function like ReLU) so that's the programming model their hardware team focused on presumably. Previously, Triton was trying to automatically detect cases where they could still use the TMA (I think that code was probably written by NVIDIA engineers), but it was messy/ineffective enough they just gave up on that approach recently: https://github.com/openai/triton/pull/3080
If all you care about is "beauty" then maybe TMAs being fixed-function-ish is bad, and Hopper's more complicated programming model is bad, and Triton is the right level of abstraction. But it turns out that when you're buying >$10 billion worth of GPUs like Meta is, you tend to care a little bit more about making good use of them rather than just beautiful abstractions: https://github.com/pytorch/pytorch/issues/106991
And I don't think this is NVIDIA-specific, every hardware architecture is likely to end up leaving clever tricks on the table (or potentially not bother implementing them in HW because they don't think they could make use of them) if they stick to these levels of abstraction. My personal opinion is it probably makes sense to have an architecture that can easily reach decent efficiency for many use cases (important for algorithmic innovation) but also supports advanced optimisation to get a further >2x with ninja coding (important for deploying at OpenAI/Meta/Google scale). I'd argue modern GPUs are actually pretty good from that perspective.
This might be true in an absolute sense, but relative to all of the alternatives for HPC, I find CUDA a *lot* more than beautiful and powerful than everything else I've looked into... part of the reason it has been successful overall (rather than in AI specifically) is that the alternatives have consistently been worse, partly as a result of needing to be multi-vendor, partly due to lack of vision and bad luck, and partly because CUDA did makes a few very good choices. I'm including OpenCL/HSA/OneAPI/SYCL/Vulkan Compute/etc and directly programming multicore SIMD on x86/ARM into that bucket (I haven't looked into ROCm/HIP enough to have a strong opinion yet but either way that is much more recent). I'm also honestly not sure what I would even change to make a "better CUDA for HPC" as most of what I'd have in mind would just be higher levels of abstraction built on top of what we already have.
For AI specifically, there's definitely an argument that CUDA is a weird level of abstraction, and that the moat isn't as strong as it seems. But CUDA wasn't originally built for AI specifically, it was built for HPC in general, and in that context I think they've done a fantastic job compared to literally everyone else in the industry.
---
I am also skeptical that OpenAI's Triton is a better level of abstraction for AI (it is now pytorch's default backend by the way so very important in the industry not just for OpenAI).
In NVIDIA's case, with the TMA accelerators in Hopper, they cannot efficiently support a "prologue" that does preprocessing of the inputs before a GEMM, because getting the data into shared memory is effectively fixed-function now. NVIDIA's CUTLASS has for many years only supported an epilogue (post-GEMM, e.g. apply an activation function like ReLU) so that's the programming model their hardware team focused on presumably. Previously, Triton was trying to automatically detect cases where they could still use the TMA (I think that code was probably written by NVIDIA engineers), but it was messy/ineffective enough they just gave up on that approach recently: https://github.com/openai/triton/pull/3080
If all you care about is "beauty" then maybe TMAs being fixed-function-ish is bad, and Hopper's more complicated programming model is bad, and Triton is the right level of abstraction. But it turns out that when you're buying >$10 billion worth of GPUs like Meta is, you tend to care a little bit more about making good use of them rather than just beautiful abstractions: https://github.com/pytorch/pytorch/issues/106991
And I don't think this is NVIDIA-specific, every hardware architecture is likely to end up leaving clever tricks on the table (or potentially not bother implementing them in HW because they don't think they could make use of them) if they stick to these levels of abstraction. My personal opinion is it probably makes sense to have an architecture that can easily reach decent efficiency for many use cases (important for algorithmic innovation) but also supports advanced optimisation to get a further >2x with ninja coding (important for deploying at OpenAI/Meta/Google scale). I'd argue modern GPUs are actually pretty good from that perspective.