So kopite7kimi is back to claiming 512-bit bus. *shrug*
I still dont think that makes a lot of sense if they're also moving to GDDR7, but I dunno. Maybe if they're doing away with the large L2 as a bandwidth supplement, it could make sense.
The answer is unironically data center.I still dont think that makes a lot of sense if they're also moving to GDDR7, but I dunno. Maybe if they're doing away with the large L2 as a bandwidth supplement, it could make sense.
The answer is unironically data center.
You get 64/96GB of VRAM on the upper end clamshells that way (with 16/24Gb ICs that is).
who.Tenstorrent is already doing "cheaper" AI accelerators as a strategy
You do.and don't need 192gb of HBM like in training.
a) it's not that powerhungry800w Blackwell incoming, would hardly be surprised at the same strategy from AMD as well.
not going anywhere unless the sorta useful stuff people gonna pay money for (office copilot) evaporates.I do wonder how long we'll see cloud inference being a "major" thing though.
Client devices has tiny DRAM amounts and nonexistent membw, they're doomed to never do anything useful wrt machine learning.especially when consumers will willingly pay for edge for devices that can run models themselves.
and bandwidth!Makes sense for the Blackwell successor to bump VRAM capacity for that market.
Probably should just say "Jim Keller", everyone knows himwho.
99% of models get slimmed down for inference compared to training, true you need less compute per mb for inference than for training, but that's still a point in GDDR's favor for inference. I.E. 96gb can hold current Mixtral (at 16bit no less) Less compute, less bw, less HBM. Actually do you need that much BW for inference at all? Maybe DDR with the right models would be good enough, that's like a fourth the cost of GDDR.You do.
Large VRAM is primarily for inference.
Yeah I double checked the math after posting, 600w way more likely, especially with that 28gbps, oopa) it's not that powerhungry
b) yea lol Navi50 will be sold in that market too.
RAM is cheap and mini models are ramping up, Gemini nano runs on phones already, as does realtime translation. "Networked this/that/everything" has been a dream of engineers for decades now. But if you can run it on edge then edge ends up overtaking it for consumers. Cloud inference will always have customers. But we'll also watch phones and laptops double in ram in the blink of an eye, and more inference will be run there than cloud, at least in terms of number of users.not going anywhere unless the sorta useful stuff people gonna pay money for (office copilot) evaporates.
Client devices has tiny DRAM amounts and nonexistent membw, they're doomed to never do anything useful wrt machine learning.
The joke is Tenstorrent has neither product nor roadmap, and it all slipped anyway.Probably should just say "Jim Keller", everyone knows him
Que.99% of models get slimmed down for inference compared to training
Holy shit YES.Actually do you need that much BW for inference at all?
Not even that.600w way more likely, especially with that 28gbps, oop
Are you serious?RAM is cheap
It's a party trick, not a Copilot alternative.Gemini nano runs on phones already, as does realtime translation.
Who the fuck gonna pay for all that BOM?But we'll also watch phones and laptops double in ram in the blink of an eye
So kopite7kimi is back to claiming 512-bit bus. *shrug*
I still dont think that makes a lot of sense if they're also moving to GDDR7, but I dunno. Maybe if they're doing away with the large L2 as a bandwidth supplement, it could make sense.
Genuinely curious about those core vs memory underclock results. I would have expected a 7B+ LLM with a batch size of 1 to be massively bandwidth limited for most kernels. The only thing I can think of is maybe the (de-)quantisation step is a lot more expensive than I'm assuming it is? Or this is a case where you are still going through PCIE for some of the data (maybe unintentionally depending on the framework you're using) - do you mean Mixtral 8x7B where most of the data will need to be in CPU DRAM? Either way, for edge like your use case though, AFAIK you can typically get more than high enough token/s for any model that fits in VRAM, so performance isn't the main bottleneck, it's memory capcity.Some of the work I've been doing is training a Mistrial 7B LLM on home automation things, and then distilling / pruning stuff I don't need / wont use for said home automation. I also end up quantizing it down to like six or even four bits, seeing what I can do to get it all wedged into the 12GB of ram on my 3080Ti while keeping my target capabilities. I'm still not good at it yet Nevertheless, I have done some playing with hardware speeds. If I leave all the clocks at my 3080Ti's max undervolt (1695 core @ 750mV / 16500 memory) I can get upwards of 30 tokens/sec depending on what I've done to the language model. if I reduce memory clocks by 70% (5002MHz) the token rate is only barely affected, maybe it will lose 10% and sometimes it's not even measurable. However, if I crank the GPU clock down by about 55% (810MHz) the token rate drops by half as well.
I can see assuming 512-bit bus for GB202 and do 448 bit bus for 28GB 5090 and then maybe a 32GB 5090 Ti. My big wonder is GB203 because 16GB won't cut it for an 80 class type card these days and same for 12-ish or so for a 70 class. Like I do wonder if Nvidia has multiple designs they are deciding on because of GDDR7 3GB not on time.AMD cut back on RDNA 3 Infinity Cache in exchange for higher vram bandwidth so it’s possible. If these rumors pan out GB202 has 33% more SMs and 75% more bandwidth. Though I wouldn’t be shocked at a 480-bit bus to help with yields. Either way it’s still a big bandwidth bump.
With such a relatively small increase in SMs higher clocks may be part of the package.
16GBs are fine for whatever "class" and will be for quite some time still I'd say.
12 though are less than ideal for anything faster than a 3060.
Could you provide any data to back up that claim? Haven't seen the game showing any VRAM issues on any cards on any preset thus far. IIRC they are allocating all VRAM available but that doesn't mean that they need that much to run without issues.Frontiers of Pandora takes > 16gb for it's "Unobtanium" setting
I'd go further in saying that not only is 16GB fine for basically anything, but 12GB is fine for most 'high end' gaming experiences as well if you're not so averse to turning down a setting or two or using DLSS.16GBs are fine for whatever "class" and will be for quite some time still I'd say.
12 though are less than ideal for anything faster than a 3060.
12GBs are borderline. They are okay now for console level settings but there will likely be an expansion of PC exclusive features again which will eat into VRAM on top of your typical console requirements. I basically expect 12GBs during next GPU gen to fare similarly to how 8GBs did during this gen.I'd go further in saying that not only is 16GB fine for basically anything, but 12GB is fine for most 'high end' gaming experiences as well if you're not so averse to turning down a setting or two or using DLSS.
The time you were paying a xx80 GPU 35% less for just 10% lower performance than a 90 class model and using the top die is over. Blackwell will make the performance gap between xx80 and xx90 class even biggerNvidia won’t have any trouble selling a 16GB 5080. There’s also a chance that the 5080 is a cut down GB202.