NVIDIA discussion [2024]

I'm more intrigued by Jim Keller's original comment that CUDA is "not beautiful".

This might be true in an absolute sense, but relative to all of the alternatives for HPC, I find CUDA a *lot* more than beautiful and powerful than everything else I've looked into... part of the reason it has been successful overall (rather than in AI specifically) is that the alternatives have consistently been worse, partly as a result of needing to be multi-vendor, partly due to lack of vision and bad luck, and partly because CUDA did makes a few very good choices. I'm including OpenCL/HSA/OneAPI/SYCL/Vulkan Compute/etc and directly programming multicore SIMD on x86/ARM into that bucket (I haven't looked into ROCm/HIP enough to have a strong opinion yet but either way that is much more recent). I'm also honestly not sure what I would even change to make a "better CUDA for HPC" as most of what I'd have in mind would just be higher levels of abstraction built on top of what we already have.

For AI specifically, there's definitely an argument that CUDA is a weird level of abstraction, and that the moat isn't as strong as it seems. But CUDA wasn't originally built for AI specifically, it was built for HPC in general, and in that context I think they've done a fantastic job compared to literally everyone else in the industry.

---

I am also skeptical that OpenAI's Triton is a better level of abstraction for AI (it is now pytorch's default backend by the way so very important in the industry not just for OpenAI).

In NVIDIA's case, with the TMA accelerators in Hopper, they cannot efficiently support a "prologue" that does preprocessing of the inputs before a GEMM, because getting the data into shared memory is effectively fixed-function now. NVIDIA's CUTLASS has for many years only supported an epilogue (post-GEMM, e.g. apply an activation function like ReLU) so that's the programming model their hardware team focused on presumably. Previously, Triton was trying to automatically detect cases where they could still use the TMA (I think that code was probably written by NVIDIA engineers), but it was messy/ineffective enough they just gave up on that approach recently: https://github.com/openai/triton/pull/3080

If all you care about is "beauty" then maybe TMAs being fixed-function-ish is bad, and Hopper's more complicated programming model is bad, and Triton is the right level of abstraction. But it turns out that when you're buying >$10 billion worth of GPUs like Meta is, you tend to care a little bit more about making good use of them rather than just beautiful abstractions: https://github.com/pytorch/pytorch/issues/106991

And I don't think this is NVIDIA-specific, every hardware architecture is likely to end up leaving clever tricks on the table (or potentially not bother implementing them in HW because they don't think they could make use of them) if they stick to these levels of abstraction. My personal opinion is it probably makes sense to have an architecture that can easily reach decent efficiency for many use cases (important for algorithmic innovation) but also supports advanced optimisation to get a further >2x with ninja coding (important for deploying at OpenAI/Meta/Google scale). I'd argue modern GPUs are actually pretty good from that perspective.
 
In NVIDIA's case, with the TMA accelerators in Hopper, they cannot efficiently support a "prologue" that does preprocessing of the inputs before a GEMM, because getting the data into shared memory is effectively fixed-function now. NVIDIA's CUTLASS has for many years only supported an epilogue (post-GEMM, e.g. apply an activation function like ReLU) so that's the programming model their hardware team focused on presumably. Previously, Triton was trying to automatically detect cases where they could still use the TMA (I think that code was probably written by NVIDIA engineers), but it was messy/ineffective enough they just gave up on that approach recently: https://github.com/openai/triton/pull/3080
Could you expand on this a bit? Are those prologues for decompression, including things like sparse matrix expansion, non-native datatypes, shared exponents/scale factors? I'd imagine that if a prologue just performed element-wise ops, like epilogues usually do (which can be frustratingly limited at times) it would fit in just fine with TMA use.
 
Could you expand on this a bit? Are those prologues for decompression, including things like sparse matrix expansion, non-native datatypes, shared exponents/scale factors? I'd imagine that if a prologue just performed element-wise ops, like epilogues usually do (which can be frustratingly limited at times) it would fit in just fine with TMA use.
I'm not sure what prologues are typically used for in Triton and how common they are in practice to be honest - just that the programming model allows you to do whatever you want, the GEMM is just a "tl.dot(matrixA tile, matrixB tile)" operation that can be put wherever you want in the program with tl.load and other element-wise operations interleaved as you wish. This obviously makes the compiler's job harder if "tl.load" must be heavily special-cased as is the case with TMA+Warp Specialisation.

I haven't looked at this example in any depth at all (it's just the first thing I could find that seemed interesting enough), but here's a Triton inner loop I found in a random tutorial related to optimising Llama2: Deep Dive into Kernel Fusion: Accelerating Inference in Llama V2
Code:
for _ in range(0, tl.cdiv(K, BLOCK_SIZE_K)):
       x = tl.load(x_ptrs)
       x_sum += tl.math.pow(x.to(tl.float32), 2)  # RMSNorm stat computation
       rms_w = tl.load(rms_w_ptrs)
       x = x * rms_w  # RMS weight multiplication
       w = tl.load(w_ptrs)
       accumulator += tl.dot(x, w)  # matmul between input and linear weight (QKV projection)
       x_ptrs += BLOCK_SIZE_K * stride_x_k  # next input blocks by increasing the pointer
       w_ptrs += BLOCK_SIZE_K * stride_w_k
       rms_w_ptrs += BLOCK_SIZE_K * stride_rms_w
x_mean = tl.sum(x_sum, axis=1) / K + EPS
x_norm = tl.math.rsqrt(x_mean)
accumulator = accumulator * x_norm[:, None]  # applies RMSNorm on the output of the matmul

The "x" input to tl.dot(x,w) is multiplied by rms_w before being used in tl.dot(), *and* that same x input is added (after a pow()) to x_sum which is only used after the loop. Don't read too much into this, I have no idea how common/important this specific example is, but it's a good demonstration of why that generality can be useful.

As for why TMA operations are problematic here - there's actually 3 aspects to this:
  1. Asynchronous loads can only write to shared memory rather than registers (including both TMA in Hopper and the "bulk asynchronous copies" which were added in Ampere).
  2. NVIDIA recommends TMA to be implemented as part of a producer-consumer model with warp specialisation, where one warp in the workgroup loads tensor data into shared memory (producer) while the other warps consume ready data from shared memory.
  3. Unlike Ampere tensoer cores, Hopper actually *needs* at least one of the inputs to come from shared memory for maximum performance (and it probably does better when both are?), while Ampere tensor cores could only read/write the main register file (Hopper still always writes tensor core outputs to the register file).
In theory, I'm not sure why TMA couldn't be used without warp specialisation, which would make it a lot easier to implement (you've still got the somewhat orthogonal problem that you still need to define a layout ("tensor map") to feed to the TMA instruction ideally in host code or alternatively in the kernel itself which apparently caused other engineering problems in Triton). It probably wouldn't allow you to hide latency as effectively and maybe you'd need to do more smaller TMAs to compensate, but overall it feels like it'd have similar upsides/downsides as the "bulk aysnchronous copies" from Ampere which are already used by default (and presumably require a bit of compiler magic but a lot less than warp specialisation).

Compared to Ampere, you're still in the unfortunate situation where the data is in shared memory, and doing element-wise operations on it before the GEMM requires at a minimum that you do:
  1. Asynchronous copy from global memory to shared memory (aka local memory in OpenCL, per-workgroup scratch) --> 1 RAM write to shared memory (also needed without prologue)
  2. Read from shared memory and write to register (1 RAM read from shared + 1 RAM write to registers).
  3. Read from registers, do element-wise operations, write to registers (1 RAM read from registers + 1 RAM write to registers)
  4. Read from registers and write to shared memory (1 RAM read from registers, 1 RAM write to shared memory).
  5. Read from shared memory sending data to tensor cores (1 RAM read from shared memory, also needed without prologue).
So you've added a strict minimum of 6 RAM operations (2 on shared memory, 4 on registers) which is a lot less elegant/efficient than just streaming data straight from global memory to shared memory to tensor cores (bypassing the register file completely). Given that the tensor core peak performance has doubled but shared memory hasn't, I suspect this will start hurting performance, although maybe it's OK if you only need to do this for 1 of the 2 input tensors, I'm not sure.

From a hardware perspective, it's a little bit weird that we ended up using shared memory for this, because shared memory seems massively more flexible than it needs to be in terms of banking, with what looks to be much deeper but less wide (and therefore less power efficient) RAM banks...

In theory, Hopper's shared/L1 RAMs should be 32 banks * 32-bit data * 2048-deep and the register file should be 32 banks (8 per multiprocessor) * 256-bit data * 256-bit wide. In my experience, I'd expect >>2x power efficiency per bit read/written for the latter compared to the former, because wide/shallow is so much more power efficient than deep/narrow (deeper is more area efficient, but fewer wider banks would be even better in terms of both power and area efficiency, that wouldn't match how shared memory is expected to work with 32-bit accesses though). In practice maybe NVIDIA is splitting their shared memory banks for power efficiency or doing something clever with custom RAM macros (not sure if they could improve power efficiency in the case where accesses are consecutive and make it work more like a smaller number of wider banks somehow? that'd be a pure area efficiency loss, but the power benefit might be worth it - if they are doing that, it'd presumably be possible to measure it with sufficiently advanced microbenchmarks).

Anyway, back to the original point - I think Warp Specialisation with a producer/consumer model is the really horrible thing to do for the general case in something like Triton, and implementing a prologue otherwise is probably OK-ish, it just prevents you from getting a lot of the benefit of Hopper tensor cores reading straight from shared memory etc... It's not clear how beneficial TMA is compared to just aysnc loads if you're not using a producer-consumer model as I haven't seen any code that actually even attempts to do that in practice, but I don't see why it wouldn't work in theory, it might just not provide as much benefit.

Actually, I wouldn't be that surprised if the biggest benefit of TMA in CUTLASS was that it allows you to do a "multicast" from global memory to the shared memory of multiple workgroups in the same "cluster group" (the new per-GPC mega-workgroup construct added in Hopper) which should result in better cache and power efficiency (data is only sent from L2 to GPC once, resulting in fewer L2 reads and no risk of L2 misses, and less wire energy moving data around compared to sending it to each SM individually which would go through most of the distance of the GPC multiple times).

Most of the functionality I'm describing is PTX-only and not even exposed in CUDA itself by the way (e.g.: TMA) - and much of it is "sm_90a" which is the "Hopper only profile, no forward compatibility guarantee" so NVIDIA is clearly looking at alternative ways to do all this. That makes sense as they expect most developers to rely on CUTLASS/cuBLAS anyway (while Triton is generating its own PTX). Assuming Blackwell is a completely new shader core ala Volta, it'll be very interesting to see which direction they take!
 
Anecdotal but here’s an example of specialized AI crunching hardware that can’t seem to find a foothold in the market.

It’s a good bet the only real challenge to Nvidia’s dominant position will come from AMD/Intel or the big cloud providers rolling their own chips and the latter isn’t guaranteed. Smaller players will struggle to break through even with better specs on paper.

 
February 21, 2024
NVIDIA, in collaboration with Google, today launched optimizations across all NVIDIA AI platforms for Gemma — Google’s state-of-the-art new lightweight 2 billion– and 7 billion-parameter open language models that can be run anywhere, reducing costs and speeding innovative work for domain-specific use cases.

Teams from the companies worked closely together to accelerate the performance of Gemma — built from the same research and technology used to create the Gemini models — with NVIDIA TensorRT-LLM, an open-source library for optimizing large language model inference, when running on NVIDIA GPUs in the data center, in the cloud and on PCs with NVIDIA RTX GPUs.
...
Adding support for Gemma soon is Chat with RTX, an NVIDIA tech demo that uses retrieval-augmented generation and TensorRT-LLM software to give users generative AI capabilities on their local, RTX-powered Windows PCs. The Chat with RTX lets users personalize a chatbot with their own data by easily connecting local files on a PC to a large language model.
 
Anecdotal but here’s an example of specialized AI crunching hardware that can’t seem to find a foothold in the market.

It’s a good bet the only real challenge to Nvidia’s dominant position will come from AMD/Intel or the big cloud providers rolling their own chips and the latter isn’t guaranteed. Smaller players will struggle to break through even with better specs on paper.

Heh, I actually live about 30 seconds from Graphcore's Bristol offices at the moment, I haven't talked with anyone working there since I interviewed in 2017 though (I talked with Simon Knowles in 2017 and a few times in the Icera era while I was writing articles for Beyond3D, I was meant to write a longer-form architecture article about their programmable baseband architecture but ended up only writing news pieces/industry analysis/long forum posts about them unfortunately) but did hear a few things through the grapevines over the years.

I should probably poke them sometimes (maybe after I finish my Hopper microarchitecture article I'm currently working on...) because I'm still confused how they haven't managed to replicate Cerebras or Groq's successes in training and inference respectively, as their architecture trade-offs aren't so fundamentally different.

For context, Cerebras/Groq/Graphcore 1st Gen all kept 100% of the model weights in SRAM, so the size of the model you could support was proportional to the number of chips that were connected together (where in Cerebras' case the "minimum" number of chips was very high since it's wafer-scale, while Graphcore/Groq started at one chip and scaled up to 100s).

That obviously doesn't work for LLMs, unless you have a *LOT* of chips, which is exactly what Groq has now demonstrated for inference: 576 chips * 230MiB of SRAM = ~129MiB of SRAM which is enough for Llama2 or Mistral with FP8. Graphcore did support up to 256 chips with Mk2 so 230GiB SRAM, and they supported streaming from DRAM much more slowly (with FPGAs apparently... not ideal obviously). I don't see why they couldn't have done the same thing... although possibly Groq's approach is more suitable to having a "pipeline" between chips, that's mostly a software problem though ("just" software... aka the bane of every [AI] HW startup ever). Semianalysis got a cost analysis of Groq's 14nm chip here: https://www.semianalysis.com/p/groq-inference-tokenomics-speed-but

On the other end, you've got Cerebras which pivoted (on the same silicon) from doing "100% of model weights in SRAM, activations/gradients streaming through, batch size of 1" to "0% of model weights in SRAM, activations/gradients stored in SRAM, insanely large batch size to amortise streaming weights". Cerebras actually has *really* bad bandwidth between their wafer-scale processor and the rest of the system, so they can't stream weights very fast, which means it must be amortised over a huge batch. I'm honestly surprised their training runs seem to work so well with such a large batch size (they released public LLM models which were very competitive at the time) but it seems to work - I'm not sure whether that batch size will necessarily be as efficient for training every kind of network...

Finally, there's also Tenstorrent, which is a hybrid "quite a bit of SRAM, fairly high DRAM bandwidth streaming weights" where they effectively save ~all of the activation DRAM bandwidth but not much of the weights bandwidth. Jim Keller is obviously extremely smart, but I find the lack of detailed information on activation vs weight bandwidth and impact of batch sizes in his presentations a bit misleading, if you can get away with large batch sizes then you might be able to save more weights DRAM bandwidth than they can save in DRAM activations bandwidth...

It feels to me like Graphcore's HW architecture should be flexible enough to support all of these different approaches with reasonable efficiency, so their apparent inability to do so is quite disappointing... Possibly it's issues with interconnects/scaling (and they had to layoff their team doing the large-scale server scaling so that's gone now anyway), possibly it's "just software" while they had too much of a HW-centric approach. Or maybe they decided to pivot to a quite different architecture/approach for their next-gen after Mk2 which would reduce the benefit of doing all that throwaway SW work, and Mk3 seems very late since Mk2 came out in early 2021 which puts them in an awkward position...

I have some theories of what went wrong technically & culturally in more detail, but I don't think it'd be appropriate for me to write them here, and I still have a lot of respect for Simon either way - it's a tough market. He joked to me in 2017 that he knew he'd probably make more money keeping his NVIDIA shares (from the Icera acquisition) than doing another startup, but he'd have more fun doing a startup, so that's what he was doing. I'm not sure how he'd feel about it today given NVIDIA's meteoric stock price increase, but I can respect that :)
 
Anecdotal but here’s an example of specialized AI crunching hardware that can’t seem to find a foothold in the market.

It’s a good bet the only real challenge to Nvidia’s dominant position will come from AMD/Intel or the big cloud providers rolling their own chips and the latter isn’t guaranteed. Smaller players will struggle to break through even with better specs on paper.

More to the point... on whether anyone but AMD/Intel have any chance to gain non-negligible market share, in my opinion:
  • Cerebras might win a decent niche for training, but depends how competitive their next-gen is vs NVIDIA Blackwell, especially as NV will no longer be supply limited soon (main reason both they & AMD have been so successful last year).
  • Qualcomm has a chance in inference, Cloud AI 100 has been really good in MLPerf for years but they haven't managed to translate that into sales. I think 100 Ultra is a very sensible refresh (same arch but bigger to make large LLMs fit on single device, suspect move TSMC N7 to N6 which might be the perf/$ sweetspot for such a SRAM/DRAM PHY heavy chip).
  • Groq might carve itself a niche especially in "inference as a service", their architecture feels extremely efficient for models they can completely fit in SRAM, their 1D "pipeline" approach is clever and maybe more efficient than e.g. Tenstorrent/Graphcore's 2D grid. But they're stuck on their original 14nm Globalfoundries chip from 2019 which is very outdated, and it's likely models will get larger and less suitable to their architecture, so time is really not on their side... not sure their architecture is suitable to benefit from dynamic sparsity either.
  • Tenstorrent: Practically no chance in datacenter IMO, but I can see them being very successful as an IP provider, e.g. integrated into SoCs for various devices like TVs/cameras/etc. or integrated as a chiplet next to those SoCs (while probably also being successful licensing their RISC-V IP/chiplets).
I'm pretty optimistic about AMD's chances (MI300X doesn't feel cost competitive with Hopper but NVIDIA's premium is insane so they could still easily be price competitive). On the other hand, I feel like Intel's AI efforts are just a never-ending tragedy of "too little too late"... Gaudi is a decent architecture, but Gaudi3 just feels too incremental and too late. I'm sure they'll get some customers and make a decent amount of money from it, but I expect MI300X will dominate Gaudi3 in terms of revenue. And then after Gaudi3, they're scrapping the entire architecture and moving their AI accelerators to being GPU-based, which creates a lot of risk, but hopefully they are successful with it.

Overall I completely agree it's hard to imagine any of them (except maybe AMD & proprietary chips from MS/AWS/Google) will be a real challenge to NVIDIA. But it's a huge market, so they could still be successful in absolute terms.
 
NVIDIA made 22 billion in Q4 2023, the highest Intel has ever earned in one quarter is 20.

Guidance for Q1 2024 is 24 billion, up by 2 billion Q/Q.
Incredible quarter and year thanks to AI boom, but...
You mean Q4/FY24 and Q1/FY25, NVIDIA doesn't do calendar quarters, and they made 12.3 billion, not 22 billion, that's revenue (same for Intels 20.5 billion record quarter)
 
Judging from the drop on the previous trading day and the trading day of the report (the report was released after market closed), it looks like the market was already worrying that the boom is not sustainable (the stock dropped ~7% during the two days). After the report released, after-market trading went up by 9%. Interestingly, AMD dropped a lot too but a bit less. After-market trading also rebounded for AMD too.

I guess this probably means that the market was (is?) skeptical that the AI boom could continue. On the other hand, from what I've heard AI chips are still in high demand and lead times for multi-GPU AI machines are still very long. The key question would be whether this is going to be like the smartphone industry (and the mobile phone industry before that), or will we see another "AI winter" again (like countless others before).

A side note: while NVIDIA's PE ratio is in the crazy region (~90) for now, if NVIDIA can keep up with this $5 EPS per Q for the next year it would be a very reasonable PE ratio of ~35 with its current price (which is on the similar level of Microsoft).
 
Judging from the drop on the previous trading day and the trading day of the report (the report was released after market closed), it looks like the market was already worrying that the boom is not sustainable (the stock dropped ~7% during the two days). After the report released, after-market trading went up by 9%. Interestingly, AMD dropped a lot too but a bit less. After-market trading also rebounded for AMD too.

I guess this probably means that the market was (is?) skeptical that the AI boom could continue. On the other hand, from what I've heard AI chips are still in high demand and lead times for multi-GPU AI machines are still very long. The key question would be whether this is going to be like the smartphone industry (and the mobile phone industry before that), or will we see another "AI winter" again (like countless others before).

A side note: while NVIDIA's PE ratio is in the crazy region (~90) for now, if NVIDIA can keep up with this $5 EPS per Q for the next year it would be a very reasonable PE ratio of ~35 with its current price (which is on the similar level of Microsoft).

Yeah forward PE doesn’t seem that expensive at all. And it’ll drop further given Q1 guidance.

The stock likely dropped for a few reasons. It’s been on a tear since the year started and investors took profits ahead of the uncertainty of earnings. Other people sold out of pure fear / incredulity that the run could continue based on how much the stock has run up recently. Recent reports of shorter lead times also spooked some folks into thinking supply was catching up to demand (which Nvidia confirmed was not the case on the call).

The forward Q1 guidance and hints that demand will continue to outstrip supply through the end of the year will likely keep the rally going. Hard to imagine it can continue much beyond that but it depends on sustained end consumer demand. It can happen if the demand drivers that Nvidia rattled off actually pan out (auto, health care, sovereign LLMs, pharma).

One interesting thing about this data center transition to compute accelerators is that it seems the hardware is actually being put through the wringer. My guess is that a majority of the classic CPU based compute capacity at CSPs today is sitting idle due to low demand or due to running lightweight software.
 
Hard to imagine it can continue much beyond that but it depends on sustained end consumer demand. It can happen if the demand drivers that Nvidia rattled off actually pan out (auto, health care, sovereign LLMs, pharma).
Revenue from China has decreased significantly due to the US bans, however other competitors like Huawei are unable to fill in the gap due to limited yields on their chips, so NVIDIA is back again supplying China with new castrated chips, this will raise revenue from China in the next quarters.

NVIDIA is also betting on India, Japan and several other nations (who want to build their own AI institutions) to keep demand high.
 
It took nVidia 18 years from G80 to Hopper, but only one quarter to show the world how much better accelerated computing is.
 
It took quite a bit longer than a quarter and wouldn’t have happened without the mass market awareness brought on by ChatGPT. But yes this was a long term play and they’re finally cashing in big time.
 
saupload_NVDA_data_center_chart_022124_thumb1.png


ChatGPT was the iPhone moment, but it took nVidia at least 10 years to make it happened.
 
Eh, these margins are from DC products though - which are partially "chiplet" based in case of Nvidia too. That's even disregarding the whole AI boom which is responsible for said margins. So this isn't any proof either.
The margins on a 4090 are certainly higher than on a 1080ti.
 
Back
Top