Speculation and Rumors: Nvidia Blackwell ...

Seanspeed · Mar 11, 2024

https://videocardz.com/newz/nvidia-geforce-rtx-50-series-reportedly-features-28-gbps-gddr7-memory

So kopite7kimi is back to claiming 512-bit bus. *shrug*

I still dont think that makes a lot of sense if they're also moving to GDDR7, but I dunno. Maybe if they're doing away with the large L2 as a bandwidth supplement, it could make sense.

Bondrewd · Mar 11, 2024

Seanspeed said:
I still dont think that makes a lot of sense if they're also moving to GDDR7, but I dunno. Maybe if they're doing away with the large L2 as a bandwidth supplement, it could make sense.

The answer is unironically data center.
You get 64/96GB of VRAM on the upper end clamshells that way (with 16/24Gb ICs that is).

Frenetic Pony · Mar 11, 2024

Bondrewd said:
The answer is unironically data center.
You get 64/96GB of VRAM on the upper end clamshells that way (with 16/24Gb ICs that is).

I can see it, Tenstorrent is already doing "cheaper" AI accelerators as a strategy. If you're doing cloud inference you need cheap first and foremost, and don't need 192gb of HBM like in training.

800w Blackwell incoming, would hardly be surprised at the same strategy from AMD as well. I'm sure some youtuber is going to buy one of these at $5k or and show it running games for gag, but at that point it's probably stretching "consumer" beyond credibility even for PR purposes.

I do wonder how long we'll see cloud inference being a "major" thing though. AI model makers are already starting to understand cloud inferencing isn't very profitable, especially when consumers will willingly pay for edge for devices that can run models themselves.

trinibwoy · Mar 11, 2024

Nvidia already has a "cheap" datacenter accelerator with the PCIe L40S based on AD102. It goes for about $12K. Makes sense for the Blackwell successor to bump VRAM capacity for that market.

Bondrewd · Mar 12, 2024

Frenetic Pony said:
Tenstorrent is already doing "cheaper" AI accelerators as a strategy

who.

Frenetic Pony said:
and don't need 192gb of HBM like in training.

You do.
Large VRAM is primarily for inference.
That's why MS is vacuuming ~most of MI300X supply for Copilot.

Frenetic Pony said:
800w Blackwell incoming, would hardly be surprised at the same strategy from AMD as well.

a) it's not that powerhungry
b) yea lol Navi50 will be sold in that market too.

Frenetic Pony said:
I do wonder how long we'll see cloud inference being a "major" thing though.

not going anywhere unless the sorta useful stuff people gonna pay money for (office copilot) evaporates.

Frenetic Pony said:
especially when consumers will willingly pay for edge for devices that can run models themselves.

Client devices has tiny DRAM amounts and nonexistent membw, they're doomed to never do anything useful wrt machine learning.

trinibwoy said:
Makes sense for the Blackwell successor to bump VRAM capacity for that market.

and bandwidth!

Frenetic Pony · Mar 12, 2024

Bondrewd said:
who.

Probably should just say "Jim Keller", everyone knows him

Bondrewd said:
You do.

Large VRAM is primarily for inference.

99% of models get slimmed down for inference compared to training, true you need less compute per mb for inference than for training, but that's still a point in GDDR's favor for inference. I.E. 96gb can hold current Mixtral (at 16bit no less) Less compute, less bw, less HBM. Actually do you need that much BW for inference at all? Maybe DDR with the right models would be good enough, that's like a fourth the cost of GDDR.

Bondrewd said:
a) it's not that powerhungry
b) yea lol Navi50 will be sold in that market too.

Yeah I double checked the math after posting, 600w way more likely, especially with that 28gbps, oop

Bondrewd said:
not going anywhere unless the sorta useful stuff people gonna pay money for (office copilot) evaporates.

Client devices has tiny DRAM amounts and nonexistent membw, they're doomed to never do anything useful wrt machine learning.

RAM is cheap and mini models are ramping up, Gemini nano runs on phones already, as does realtime translation. "Networked this/that/everything" has been a dream of engineers for decades now. But if you can run it on edge then edge ends up overtaking it for consumers. Cloud inference will always have customers. But we'll also watch phones and laptops double in ram in the blink of an eye, and more inference will be run there than cloud, at least in terms of number of users.

Once "OpenOffice Brain" or whatever that can run in 10gb of ram comes out as "good enough" that'll be a lot more customers than Office365.

Albuquerque · Mar 12, 2024

So, I have some experience now with hosting LLMs at home. Training, distilling, and quantizing all will consume a pretty serious chunk of memory bandwidth. As for inference though, it's far more about compute and low latency / direct access to memory. CPU inference is slow mostly due to the CPU itself; GPU inference backed by main memory (eg not VRAM) on an x86 platform gets ugly because the latency from GPU to CPU (memory) is atrocious.

Some of the work I've been doing is training a Mistrial 7B LLM on home automation things, and then distilling / pruning stuff I don't need / wont use for said home automation. I also end up quantizing it down to like six or even four bits, seeing what I can do to get it all wedged into the 12GB of ram on my 3080Ti while keeping my target capabilities. I'm still not good at it yet

Nevertheless, I have done some playing with hardware speeds. If I leave all the clocks at my 3080Ti's max undervolt (1695 core @ 750mV / 16500 memory) I can get upwards of 30 tokens/sec depending on what I've done to the language model. if I reduce memory clocks by 70% (5002MHz) the token rate is only barely affected, maybe it will lose 10% and sometimes it's not even measurable. However, if I crank the GPU clock down by about 55% (810MHz) the token rate drops by half as well.

"Slow" commodity GPUs with big memory pools have become quite sought after for exactly this reason; they're quickly being gobbled up by LLM enthusiasts and driving the prices up. Further to this point, this is also why the original Apple M1's (not even the 'big' SKUs) make great inference devices if you get them with 32GB of ram. Apple silicon does not use separate memory pools for CPU vs GPU, so main memory is shared and accessible by both. The Apple M1's (standard sku) memory bandwidth is less than a well equipped i9-13900k can support, however the GPU compute in the M1 combined with the singular global pool of memory means the latency kept very low and the compute is (relative to the i9 process) significantly enhanced.

Bondrewd · Mar 12, 2024

Frenetic Pony said:
Probably should just say "Jim Keller", everyone knows him

The joke is Tenstorrent has neither product nor roadmap, and it all slipped anyway.

Frenetic Pony said:
99% of models get slimmed down for inference compared to training

Que.

Frenetic Pony said:
Actually do you need that much BW for inference at all?

Holy shit YES.

Frenetic Pony said:
600w way more likely, especially with that 28gbps, oop

Not even that.

Frenetic Pony said:
RAM is cheap

Are you serious?
DRAM scaling is ~dead~.

Frenetic Pony said:
Gemini nano runs on phones already, as does realtime translation.

It's a party trick, not a Copilot alternative.

Frenetic Pony said:
But we'll also watch phones and laptops double in ram in the blink of an eye

Who the fuck gonna pay for all that BOM?
Hype is fun but ML shit is money, and client stuff runs razor thin margins as is.

trinibwoy · Mar 12, 2024

Seanspeed said:
https://videocardz.com/newz/nvidia-geforce-rtx-50-series-reportedly-features-28-gbps-gddr7-memory

So kopite7kimi is back to claiming 512-bit bus. *shrug*

I still dont think that makes a lot of sense if they're also moving to GDDR7, but I dunno. Maybe if they're doing away with the large L2 as a bandwidth supplement, it could make sense.

AMD cut back on RDNA 3 Infinity Cache in exchange for higher vram bandwidth so it’s possible. If these rumors pan out GB202 has 33% more SMs and 75% more bandwidth. Though I wouldn’t be shocked at a 480-bit bus to help with yields. Either way it’s still a big bandwidth bump.

With such a relatively small increase in SMs higher clocks may be part of the package.

Arun · Mar 12, 2024

Albuquerque said:
Some of the work I've been doing is training a Mistrial 7B LLM on home automation things, and then distilling / pruning stuff I don't need / wont use for said home automation. I also end up quantizing it down to like six or even four bits, seeing what I can do to get it all wedged into the 12GB of ram on my 3080Ti while keeping my target capabilities. I'm still not good at it yet Nevertheless, I have done some playing with hardware speeds. If I leave all the clocks at my 3080Ti's max undervolt (1695 core @ 750mV / 16500 memory) I can get upwards of 30 tokens/sec depending on what I've done to the language model. if I reduce memory clocks by 70% (5002MHz) the token rate is only barely affected, maybe it will lose 10% and sometimes it's not even measurable. However, if I crank the GPU clock down by about 55% (810MHz) the token rate drops by half as well.

Genuinely curious about those core vs memory underclock results. I would have expected a 7B+ LLM with a batch size of 1 to be massively bandwidth limited for most kernels. The only thing I can think of is maybe the (de-)quantisation step is a lot more expensive than I'm assuming it is? Or this is a case where you are still going through PCIE for some of the data (maybe unintentionally depending on the framework you're using) - do you mean Mixtral 8x7B where most of the data will need to be in CPU DRAM? Either way, for edge like your use case though, AFAIK you can typically get more than high enough token/s for any model that fits in VRAM, so performance isn't the main bottleneck, it's memory capcity.

Bondrewd is correct that LLM inference performance is mostly about memory bandwidth afaik, the only exception is if you have a very large VRAM pool and you can get away with a *huge* batch size for your use case (i.e. you are not latency sensitive - one example I've been looking a bit into is generating synthetic data in bulk). There's an interesting non-linearity here: increasing batch size will increase performance but latency will get worse (since per-user throughput is performance divided by batch size). So if you compare H100 80GB with H200 141GB where you have much higher baseline performance *and* much higher memory capacity, you can increase batch size for the same latency, which means performance (=> tokens/$) will increase faster than the raw performance increase at iso-latency. The same applies to MI300X and Blackwell obviously.

Albuquerque · Mar 12, 2024

I'm not smart enough in this space to give you good answers, however if I were performing memory access across the PCI-E interface, tokenrate tanks to less than 3 tokens/sec. It becomes SORELY obvious when requests spill over to main memory, so I'm confident in saying my results aren't reflective of such.

Again, I think this is why both the M1 Macs and really cheap (like, Pascal-era) video cards with big VRAM pools are doing so well for inference. Neither of those products have big bandwidth numbers (well, compared to some of the most modern stuff) it just so happens the GPU has direct access to a memory pool of sufficient size to hold the entire dataset at incredibly low latencies. I feel it pertinent to point out again: the base-SKU M1 processor has ~20% less memory bandwidth than an i9-13900k running top-end DDR5, yet the M1 absolutely smokes a 13900k in inference speeds. Like, it's not even close.

You could rightfully chalk this up to the far stronger GPU in the M1, but it also points to memory bandwidth being less of a constraint (for inference) than we might think it is. Given my adjacent knowledge in the x86 server space, my only theory here is a latency play rather than a pure bandwidth play. Here's a proxy to put it in different terms: so many people thought the original SSD's were faster than HDD's because of their bandwidth numbers. Speaking in naive terms, PCIe 4 drives are hitting 7500MB/sec which is 20x the absolute fastest twin-head Seagate spinners hitting ~400MB/sec. However if you dig deeper, the absolute speeds of flash storage (even dating back to SSDs on SATA) was never really about the 5x, 10x or 20x bandwidth, it was about the hundred-fold and now thousand-fold decrease in latency.

So at least that's where my head is, although I'm simply not good enough at this technology stack to say all of this with certainty. I can only report on what I've been playing with in my home lab, and I'll never try to convince anyone I'm doing it right

Dangerman · Mar 12, 2024

trinibwoy said:
AMD cut back on RDNA 3 Infinity Cache in exchange for higher vram bandwidth so it’s possible. If these rumors pan out GB202 has 33% more SMs and 75% more bandwidth. Though I wouldn’t be shocked at a 480-bit bus to help with yields. Either way it’s still a big bandwidth bump.

With such a relatively small increase in SMs higher clocks may be part of the package.

I can see assuming 512-bit bus for GB202 and do 448 bit bus for 28GB 5090 and then maybe a 32GB 5090 Ti. My big wonder is GB203 because 16GB won't cut it for an 80 class type card these days and same for 12-ish or so for a 70 class. Like I do wonder if Nvidia has multiple designs they are deciding on because of GDDR7 3GB not on time.

Because 3GB modules would've great if GB20x because a 128-bit bus card could have 12GB so you could've had like 400 USD card that has that amount with maybe 4070 Super-Ti performance with GB206? Amazing for 1440p entry level.

DegustatoR · Mar 12, 2024

16GBs are fine for whatever "class" and will be for quite some time still I'd say.
12 though are less than ideal for anything faster than a 3060.

Frenetic Pony · Mar 12, 2024

DegustatoR said:
16GBs are fine for whatever "class" and will be for quite some time still I'd say.
12 though are less than ideal for anything faster than a 3060.

Frontiers of Pandora takes > 16gb for it's "Unobtanium" setting, and of course we'll see more games do this as time goes on. This puts 16gb firmly in midrange at most territory, people spending "$799"+ will want to max settings, even if its only at 30+fps.

Remij · Mar 12, 2024

What games allocate, and what they require can be vastly different things.

DegustatoR · Mar 12, 2024

Frenetic Pony said:
Frontiers of Pandora takes > 16gb for it's "Unobtanium" setting

Could you provide any data to back up that claim? Haven't seen the game showing any VRAM issues on any cards on any preset thus far. IIRC they are allocating all VRAM available but that doesn't mean that they need that much to run without issues.

Avatar: Frontiers of Pandora: PC Performance Benchmarks for Graphics Cards and Processors in Unobtanium quality mode | Action / FPS / TPS | GPU TEST

We ran Avatar: Frontiers of Pandora tests using unlocked Unobtanium graphics settings on GEFORCE RTX and RADEON RX graphics cards. We evaluated both the performance and the graphical quality of the g

gamegpu.tech

Also this one is just at Ultra but since there are no real difference between 4060Tis there I wouldn't expect that the 16GB one would suddenly start showing issues on Unobtaineoum.

Truth is 16GBs is what current gen consoles have (total VRAM yada yada) and thus it will very likely remain the "sweet spot" until we'll switch the generations of console h/w again.
The exclusions here would be titles like CP2077 which are very rare and thus won't affect the overall picture too much.
For those who think that 16GBs aren't enough well there are products with more VRAM. Prepare to pay a lot.

trinibwoy · Mar 12, 2024

Nvidia won’t have any trouble selling a 16GB 5080. There’s also a chance that the 5080 is a cut down GB202.

Seanspeed · Mar 12, 2024

DegustatoR said:
16GBs are fine for whatever "class" and will be for quite some time still I'd say.
12 though are less than ideal for anything faster than a 3060.

I'd go further in saying that not only is 16GB fine for basically anything, but 12GB is fine for most 'high end' gaming experiences as well if you're not so averse to turning down a setting or two or using DLSS.
We'll have to see how much devs lean on 'direct streaming' going forward and how well supported DirectStorage on PC becomes. I think these will be critical in the RAM discussions.

DegustatoR · Mar 12, 2024

Seanspeed said:
I'd go further in saying that not only is 16GB fine for basically anything, but 12GB is fine for most 'high end' gaming experiences as well if you're not so averse to turning down a setting or two or using DLSS.

12GBs are borderline. They are okay now for console level settings but there will likely be an expansion of PC exclusive features again which will eat into VRAM on top of your typical console requirements. I basically expect 12GBs during next GPU gen to fare similarly to how 8GBs did during this gen.

xpea · Mar 13, 2024

trinibwoy said:
Nvidia won’t have any trouble selling a 16GB 5080. There’s also a chance that the 5080 is a cut down GB202.

The time you were paying a xx80 GPU 35% less for just 10% lower performance than a 90 class model and using the top die is over. Blackwell will make the performance gap between xx80 and xx90 class even bigger

PS: don't forget that Blackwell was designed when NVDA predicted to compete with a 3nm MCM RDNA4 monster. So the increase in performance over Ada is not lower than Ada vs Ampere...

Speculation and Rumors: Nvidia Blackwell ...

Seanspeed

Bondrewd

Frenetic Pony

trinibwoy

Meh

Bondrewd

Frenetic Pony

Albuquerque

Red-headed step child

Bondrewd

trinibwoy

Meh

Arun

Unknown.

Albuquerque

Red-headed step child

Dangerman

DegustatoR

Frenetic Pony

Remij

DegustatoR

Avatar: Frontiers of Pandora: PC Performance Benchmarks for Graphics Cards and Processors in Unobtanium quality mode | Action / FPS / TPS | GPU TEST

trinibwoy

Meh

Seanspeed

DegustatoR

xpea