Are cutting edge GPU technologies memory bound

There was never any magic solution to RT or AI other than "more memory performance". More fixed function or other specialized HW logic alone won't net you major gains ...

Hmmmm, I wonder why Microsoft wants an entire nuke power plant to run GPUs if they are just sitting around underutilized waiting on memory accesses. It would seem to make more sense to load up on 3070’s instead of $15K a pop H100’s, no?
 
If your devs earn high 6 figures and demand the easiest to work with hardware, what you gonna do? I suspect the Chinese have far better training/inference architectures while Americans paper over poor architecture with hardware.

HBM/NVLink (and the also rans equivalents, Infinity Fabric etc) are the SMP of the modern age.
 

Cool stuff!

Now, with diffusion replacing autoregressive word prediction, one can generate tokens 10X faster for a single batch request, with fewer hallucinations and improved metrics as a bonus. Another win for more mathematically dense algorithms.

Autoregressive prediction always felt inefficient (algorithmically) since it's serial. Speculative multi-token prediction was a step in the right direction, but not a true paradigm shift and now we finally have one.
 
Lower latency is nice, but total token throughput in batched inference for Deepseek V3 is around 5000 tk/s with open source software at the moment for H200/MI300x, with the code likely being far from optimal.

This method isn't necessarily more efficient, if it can't benefit from the same gains from batching.
 
Back
Top