Speculation and Rumors: Nvidia Blackwell ...

I still dont think that makes a lot of sense if they're also moving to GDDR7, but I dunno. Maybe if they're doing away with the large L2 as a bandwidth supplement, it could make sense.
The answer is unironically data center.
You get 64/96GB of VRAM on the upper end clamshells that way (with 16/24Gb ICs that is).
 
The answer is unironically data center.
You get 64/96GB of VRAM on the upper end clamshells that way (with 16/24Gb ICs that is).

I can see it, Tenstorrent is already doing "cheaper" AI accelerators as a strategy. If you're doing cloud inference you need cheap first and foremost, and don't need 192gb of HBM like in training.

800w Blackwell incoming, would hardly be surprised at the same strategy from AMD as well. I'm sure some youtuber is going to buy one of these at $5k or and show it running games for gag, but at that point it's probably stretching "consumer" beyond credibility even for PR purposes.

I do wonder how long we'll see cloud inference being a "major" thing though. AI model makers are already starting to understand cloud inferencing isn't very profitable, especially when consumers will willingly pay for edge for devices that can run models themselves.
 
Nvidia already has a "cheap" datacenter accelerator with the PCIe L40S based on AD102. It goes for about $12K. Makes sense for the Blackwell successor to bump VRAM capacity for that market.
 
Tenstorrent is already doing "cheaper" AI accelerators as a strategy
who.
and don't need 192gb of HBM like in training.
You do.
Large VRAM is primarily for inference.
That's why MS is vacuuming ~most of MI300X supply for Copilot.
800w Blackwell incoming, would hardly be surprised at the same strategy from AMD as well.
a) it's not that powerhungry
b) yea lol Navi50 will be sold in that market too.
I do wonder how long we'll see cloud inference being a "major" thing though.
not going anywhere unless the sorta useful stuff people gonna pay money for (office copilot) evaporates.
especially when consumers will willingly pay for edge for devices that can run models themselves.
Client devices has tiny DRAM amounts and nonexistent membw, they're doomed to never do anything useful wrt machine learning.
Makes sense for the Blackwell successor to bump VRAM capacity for that market.
and bandwidth!
 
Probably should just say "Jim Keller", everyone knows him

You do.

Large VRAM is primarily for inference.
99% of models get slimmed down for inference compared to training, true you need less compute per mb for inference than for training, but that's still a point in GDDR's favor for inference. I.E. 96gb can hold current Mixtral (at 16bit no less) Less compute, less bw, less HBM. Actually do you need that much BW for inference at all? Maybe DDR with the right models would be good enough, that's like a fourth the cost of GDDR.

a) it's not that powerhungry
b) yea lol Navi50 will be sold in that market too.
Yeah I double checked the math after posting, 600w way more likely, especially with that 28gbps, oop

not going anywhere unless the sorta useful stuff people gonna pay money for (office copilot) evaporates.

Client devices has tiny DRAM amounts and nonexistent membw, they're doomed to never do anything useful wrt machine learning.
RAM is cheap and mini models are ramping up, Gemini nano runs on phones already, as does realtime translation. "Networked this/that/everything" has been a dream of engineers for decades now. But if you can run it on edge then edge ends up overtaking it for consumers. Cloud inference will always have customers. But we'll also watch phones and laptops double in ram in the blink of an eye, and more inference will be run there than cloud, at least in terms of number of users.

Once "OpenOffice Brain" or whatever that can run in 10gb of ram comes out as "good enough" that'll be a lot more customers than Office365.
 
Last edited:
So, I have some experience now with hosting LLMs at home. Training, distilling, and quantizing all will consume a pretty serious chunk of memory bandwidth. As for inference though, it's far more about compute and low latency / direct access to memory. CPU inference is slow mostly due to the CPU itself; GPU inference backed by main memory (eg not VRAM) on an x86 platform gets ugly because the latency from GPU to CPU (memory) is atrocious.

Some of the work I've been doing is training a Mistrial 7B LLM on home automation things, and then distilling / pruning stuff I don't need / wont use for said home automation. I also end up quantizing it down to like six or even four bits, seeing what I can do to get it all wedged into the 12GB of ram on my 3080Ti while keeping my target capabilities. I'm still not good at it yet :) Nevertheless, I have done some playing with hardware speeds. If I leave all the clocks at my 3080Ti's max undervolt (1695 core @ 750mV / 16500 memory) I can get upwards of 30 tokens/sec depending on what I've done to the language model. if I reduce memory clocks by 70% (5002MHz) the token rate is only barely affected, maybe it will lose 10% and sometimes it's not even measurable. However, if I crank the GPU clock down by about 55% (810MHz) the token rate drops by half as well.

"Slow" commodity GPUs with big memory pools have become quite sought after for exactly this reason; they're quickly being gobbled up by LLM enthusiasts and driving the prices up. Further to this point, this is also why the original Apple M1's (not even the 'big' SKUs) make great inference devices if you get them with 32GB of ram. Apple silicon does not use separate memory pools for CPU vs GPU, so main memory is shared and accessible by both. The Apple M1's (standard sku) memory bandwidth is less than a well equipped i9-13900k can support, however the GPU compute in the M1 combined with the singular global pool of memory means the latency kept very low and the compute is (relative to the i9 process) significantly enhanced.
 
Inappropriate posting style with very low signal to noise quality. Advise to writer
Probably should just say "Jim Keller", everyone knows him
The joke is Tenstorrent has neither product nor roadmap, and it all slipped anyway.
99% of models get slimmed down for inference compared to training
Que.
Actually do you need that much BW for inference at all?
Holy shit YES.
600w way more likely, especially with that 28gbps, oop
Not even that.
RAM is cheap
Are you serious?
DRAM scaling is ~dead~.
Gemini nano runs on phones already, as does realtime translation.
It's a party trick, not a Copilot alternative.
But we'll also watch phones and laptops double in ram in the blink of an eye
Who the fuck gonna pay for all that BOM?
Hype is fun but ML shit is money, and client stuff runs razor thin margins as is.
 

So kopite7kimi is back to claiming 512-bit bus. *shrug*

I still dont think that makes a lot of sense if they're also moving to GDDR7, but I dunno. Maybe if they're doing away with the large L2 as a bandwidth supplement, it could make sense.

AMD cut back on RDNA 3 Infinity Cache in exchange for higher vram bandwidth so it’s possible. If these rumors pan out GB202 has 33% more SMs and 75% more bandwidth. Though I wouldn’t be shocked at a 480-bit bus to help with yields. Either way it’s still a big bandwidth bump.

With such a relatively small increase in SMs higher clocks may be part of the package.
 
Some of the work I've been doing is training a Mistrial 7B LLM on home automation things, and then distilling / pruning stuff I don't need / wont use for said home automation. I also end up quantizing it down to like six or even four bits, seeing what I can do to get it all wedged into the 12GB of ram on my 3080Ti while keeping my target capabilities. I'm still not good at it yet :) Nevertheless, I have done some playing with hardware speeds. If I leave all the clocks at my 3080Ti's max undervolt (1695 core @ 750mV / 16500 memory) I can get upwards of 30 tokens/sec depending on what I've done to the language model. if I reduce memory clocks by 70% (5002MHz) the token rate is only barely affected, maybe it will lose 10% and sometimes it's not even measurable. However, if I crank the GPU clock down by about 55% (810MHz) the token rate drops by half as well.
Genuinely curious about those core vs memory underclock results. I would have expected a 7B+ LLM with a batch size of 1 to be massively bandwidth limited for most kernels. The only thing I can think of is maybe the (de-)quantisation step is a lot more expensive than I'm assuming it is? Or this is a case where you are still going through PCIE for some of the data (maybe unintentionally depending on the framework you're using) - do you mean Mixtral 8x7B where most of the data will need to be in CPU DRAM? Either way, for edge like your use case though, AFAIK you can typically get more than high enough token/s for any model that fits in VRAM, so performance isn't the main bottleneck, it's memory capcity.

Bondrewd is correct that LLM inference performance is mostly about memory bandwidth afaik, the only exception is if you have a very large VRAM pool and you can get away with a *huge* batch size for your use case (i.e. you are not latency sensitive - one example I've been looking a bit into is generating synthetic data in bulk). There's an interesting non-linearity here: increasing batch size will increase performance but latency will get worse (since per-user throughput is performance divided by batch size). So if you compare H100 80GB with H200 141GB where you have much higher baseline performance *and* much higher memory capacity, you can increase batch size for the same latency, which means performance (=> tokens/$) will increase faster than the raw performance increase at iso-latency. The same applies to MI300X and Blackwell obviously.
 
I'm not smart enough in this space to give you good answers, however if I were performing memory access across the PCI-E interface, tokenrate tanks to less than 3 tokens/sec. It becomes SORELY obvious when requests spill over to main memory, so I'm confident in saying my results aren't reflective of such.

Again, I think this is why both the M1 Macs and really cheap (like, Pascal-era) video cards with big VRAM pools are doing so well for inference. Neither of those products have big bandwidth numbers (well, compared to some of the most modern stuff) it just so happens the GPU has direct access to a memory pool of sufficient size to hold the entire dataset at incredibly low latencies. I feel it pertinent to point out again: the base-SKU M1 processor has ~20% less memory bandwidth than an i9-13900k running top-end DDR5, yet the M1 absolutely smokes a 13900k in inference speeds. Like, it's not even close.

You could rightfully chalk this up to the far stronger GPU in the M1, but it also points to memory bandwidth being less of a constraint (for inference) than we might think it is. Given my adjacent knowledge in the x86 server space, my only theory here is a latency play rather than a pure bandwidth play. Here's a proxy to put it in different terms: so many people thought the original SSD's were faster than HDD's because of their bandwidth numbers. Speaking in naive terms, PCIe 4 drives are hitting 7500MB/sec which is 20x the absolute fastest twin-head Seagate spinners hitting ~400MB/sec. However if you dig deeper, the absolute speeds of flash storage (even dating back to SSDs on SATA) was never really about the 5x, 10x or 20x bandwidth, it was about the hundred-fold and now thousand-fold decrease in latency.

So at least that's where my head is, although I'm simply not good enough at this technology stack to say all of this with certainty. I can only report on what I've been playing with in my home lab, and I'll never try to convince anyone I'm doing it right ;)
 
AMD cut back on RDNA 3 Infinity Cache in exchange for higher vram bandwidth so it’s possible. If these rumors pan out GB202 has 33% more SMs and 75% more bandwidth. Though I wouldn’t be shocked at a 480-bit bus to help with yields. Either way it’s still a big bandwidth bump.

With such a relatively small increase in SMs higher clocks may be part of the package.
I can see assuming 512-bit bus for GB202 and do 448 bit bus for 28GB 5090 and then maybe a 32GB 5090 Ti. My big wonder is GB203 because 16GB won't cut it for an 80 class type card these days and same for 12-ish or so for a 70 class. Like I do wonder if Nvidia has multiple designs they are deciding on because of GDDR7 3GB not on time.

Because 3GB modules would've great if GB20x because a 128-bit bus card could have 12GB so you could've had like 400 USD card that has that amount with maybe 4070 Super-Ti performance with GB206? Amazing for 1440p entry level.
 
16GBs are fine for whatever "class" and will be for quite some time still I'd say.
12 though are less than ideal for anything faster than a 3060.
 
16GBs are fine for whatever "class" and will be for quite some time still I'd say.
12 though are less than ideal for anything faster than a 3060.

Frontiers of Pandora takes > 16gb for it's "Unobtanium" setting, and of course we'll see more games do this as time goes on. This puts 16gb firmly in midrange at most territory, people spending "$799"+ will want to max settings, even if its only at 30+fps.
 
Frontiers of Pandora takes > 16gb for it's "Unobtanium" setting
Could you provide any data to back up that claim? Haven't seen the game showing any VRAM issues on any cards on any preset thus far. IIRC they are allocating all VRAM available but that doesn't mean that they need that much to run without issues.

1710285019157.png

Also this one is just at Ultra but since there are no real difference between 4060Tis there I wouldn't expect that the 16GB one would suddenly start showing issues on Unobtaineoum.

Truth is 16GBs is what current gen consoles have (total VRAM yada yada) and thus it will very likely remain the "sweet spot" until we'll switch the generations of console h/w again.
The exclusions here would be titles like CP2077 which are very rare and thus won't affect the overall picture too much.
For those who think that 16GBs aren't enough well there are products with more VRAM. Prepare to pay a lot.
 
Last edited:
Nvidia won’t have any trouble selling a 16GB 5080. There’s also a chance that the 5080 is a cut down GB202.
 
16GBs are fine for whatever "class" and will be for quite some time still I'd say.
12 though are less than ideal for anything faster than a 3060.
I'd go further in saying that not only is 16GB fine for basically anything, but 12GB is fine for most 'high end' gaming experiences as well if you're not so averse to turning down a setting or two or using DLSS.
We'll have to see how much devs lean on 'direct streaming' going forward and how well supported DirectStorage on PC becomes. I think these will be critical in the RAM discussions.
 
I'd go further in saying that not only is 16GB fine for basically anything, but 12GB is fine for most 'high end' gaming experiences as well if you're not so averse to turning down a setting or two or using DLSS.
12GBs are borderline. They are okay now for console level settings but there will likely be an expansion of PC exclusive features again which will eat into VRAM on top of your typical console requirements. I basically expect 12GBs during next GPU gen to fare similarly to how 8GBs did during this gen.
 
Nvidia won’t have any trouble selling a 16GB 5080. There’s also a chance that the 5080 is a cut down GB202.
The time you were paying a xx80 GPU 35% less for just 10% lower performance than a 90 class model and using the top die is over. Blackwell will make the performance gap between xx80 and xx90 class even bigger :(

PS: don't forget that Blackwell was designed when NVDA predicted to compete with a 3nm MCM RDNA4 monster. So the increase in performance over Ada is not lower than Ada vs Ampere...
 
Back
Top