Nvidia Blackwell Architecture Speculation

  • Thread starter Deleted member 2197
  • Start date
150mm² is low end. It just is. There is no process so advanced that it changes this. Also just a 128-bit bus. That's again, a low end spec that you use on low end parts.
So the GeForce4 (NV25) at 128mm2 and the Radeon 8500 (R200) at 120mm2 (both on 150nm) were low-end then, despite being unquestionably the fastest cards of their day? And a >$20K 2nm wafer price doesn't make any difference compared to ~$4K for 28nm in the Maxwell timeframe? (iirc, exact number could be slightly wrong)

This is literally just arguing semantics and like Albuquerque I don't get why everyone is so pointlessly obsessed about it. It's perfectly fair to point out that NVIDIA's gross margins for consumer GPUs are significantly higher this generation than in the past (it's hard to tell how much though given their revenue is now dominated by datacenter) and that this is problematic for the long-term health of the PC gaming market. You don't need to debate what's low-end or high-end to agree on that.

Hopefully AMD & Intel get more competitive, and hopefully chiplets allow for more efficient 2.5D/3D designs despite transistor costs not scaling as fast anymore. That might take a while though, and I really don't want to keep arguing this for the next few years until then...
 
we don’t have good info on what a given die costs anyway
Exactly. There's also a whole side missing from this discussion - die cost depends on how many dies from a platter is being sold. If a 150mm^2 die is one from a platter then it's cost is the same as of one 600 mm^2 die from the same platter. It doesn't have to be a result of physical defects either - a 150 mm^2 binned for some SKU as the one with ultra high clocks for example could end up being just one from a platter. The whole idea that die size mean anything about how "low end" a product is is just completely wrong.
 
Micron claims GDDR7 will deliver 50% more raytracing performance than best-in-class GDDR6X. Very vague but if they’re referring to Blackwell vs 4090 that would be a nice bump.

 
What's the GDDR6 platform then?

We can try to guess. On the slides the GDDR6X platform is 2x the performance of the GDDR6 platform at RT while being 20% faster at raster. That lines up well with the 4090 and 7900XTX if you pick something like CP2077 for the RT workload.

The GDDR6 platform is likely from AMD. Nvidia’s fastest GDDR6 card is the 4060 Ti.
 
So the GeForce4 (NV25) at 128mm2 and the Radeon 8500 (R200) at 120mm2 (both on 150nm) were low-end then, despite being unquestionably the fastest cards of their day?
They absolutely would have been low end if much larger GPU's existed on the same architecture like exists now.

150mm² is low end today, and has been for quite a while. I feel like trying to find some wormy argument to slither out of acknowledging such a basic concept is bizarre. We're not talking about some magical super advanced process beyond anything else.
 
Assuming those are the Blackwell die configurations, what is the prevailing thought as to primary driver of (presumed) performance enhancements? To my layman's eyes, it would seem the main contenders are, in no particular order:
  1. Significantly increased clock speeds;
  2. Going wider i.e., each TPC contains more 2 SMs than prior architectures, so there are more compute units without having more, or significantly more, GPCs;
  3. internal reconfigurations, such as cache improvements or enlargements or scheduling enhancements, to improve perf/mm2 or the "IPC" of the existing units;
  4. Memory bandwidth boost from GDDR7 and/or the larger bus (at least for GB202);
  5. Specialized hardware additions or improvements (e.g., BVH building fixed function hardware, expanding frame generation to additional frames, hardware to push GPU-driven work generation, etc.)
It has always been fascinating to me to see how different architectures have evolved to achieve greater performance or lower power. Like, Maxwell reconfigured the SM and kept data more local, Pascal ramped the clocks, Turing employed specialized hardware, etc.

If the GB dies aren't on a smaller node than Ada, then it'd probably be a combination of 3, 4, and 5. If there is a die shrink, it opens up 1 and 2. I don't know which way I'm leaning, and I'm just talking out of my nethers, but it's fun to speculate.
 
1. Is unlikely as the process should be the same.
2. Putting more SMs into TPC isn't really "going wider" and won't be much different from just adding more TPCs.
3. Possible.
4. That's a given. Should also highlight the possible changes here, as in more b/w on the same shading performance makes little sense but maybe they've cut their L2 and put logic there instead?
5. It's highly likely that RT h/w will be improved but you don't really improve overall performance by improving only RT h/w (i.e. just tracing of rays).

One possible scenario is them going back to 32-wide SIMDs and removing the 2-clock warp execution. Would make SM level scheduling harder (more complex in h/w) but could allow for another doubling of peak math throughput with reasonable complexity increase.
 
Assuming those are the Blackwell die configurations, what is the prevailing thought as to primary driver of (presumed) performance enhancements? To my layman's eyes, it would seem the main contenders are, in no particular order:
  1. Significantly increased clock speeds;
  2. Going wider i.e., each TPC contains more 2 SMs than prior architectures, so there are more compute units without having more, or significantly more, GPCs;
  3. internal reconfigurations, such as cache improvements or enlargements or scheduling enhancements, to improve perf/mm2 or the "IPC" of the existing units;
  4. Memory bandwidth boost from GDDR7 and/or the larger bus (at least for GB202);
  5. Specialized hardware additions or improvements (e.g., BVH building fixed function hardware, expanding frame generation to additional frames, hardware to push GPU-driven work generation, etc.)
It has always been fascinating to me to see how different architectures have evolved to achieve greater performance or lower power. Like, Maxwell reconfigured the SM and kept data more local, Pascal ramped the clocks, Turing employed specialized hardware, etc.

If the GB dies aren't on a smaller node than Ada, then it'd probably be a combination of 3, 4, and 5. If there is a die shrink, it opens up 1 and 2. I don't know which way I'm leaning, and I'm just talking out of my nethers, but it's fun to speculate.

More SMs per TPC would be a significant increase in width of each chip which is very unlikely on the same node.

My guess is higher clocks, more transistors spent on RT and a boost from GDDR7. They can always position chips lower down the stack to compensate. E.g. GB206 for the 5060 vs AD107 for the 4060. GB207 is probably a 3050 replacement.
 

Guessing time!

5090ti - 448bit bus, 28gb of ram, 525w+ $2k, 40% faster than a 4090
5090 - 384bit bus, 24gb of ram, 450w, $1500, 20% faster than a 4090

5080ti - 256bit bus, 16gb of ram, 350w, $1k, 20% faster than a 4080
5080 - 228bit bus, 14gb of ram, 300w, $750, basically a 4080 super

5070ti - 192bit bus, 12gb of ram, $600, 5% faster than 4070ti Super
5070 - 192bit bus, 12gb of ram, $500, 4070 super

5060ti - 128bit bus, 12gb of ram, $400, 4070
5060 - 96bit bus, 9gb of ram, $329, 4060ti

They'll shrink the L2 SRAM cache size with TSMC's new SRAM library that debuted in Zen 4c, use the same to increase L1 and GDDR7 bandwidth increases to increase clockspeed. Basically chips the same size with a modest performance uplift.

Then they'll update their matrix multiplication units, a very modest RT update, present a bunch of bullshit RTX AI "benchmarks" to show "how much better they are"/claim they've "doubled in performance gen over gen", launch DLSS 4 with AI framerate tripling and hole filling/claim 30fps upscaling to 90 is good enough now, etc.
 
Last edited:
Guessing time!

5090ti - 448bit bus, 28gb of ram, 525w+ $2k, 40% faster than a 4090
5090 - 384bit bus, 24gb of ram, 450w, $1500, 20% faster than a 4090

5080ti
- 256bit bus, 16gb of ram, 350w, $1k, 20% faster than a 4080
5080 - 228bit bus, 14gb of ram, 300w, $750, basically a 4080 super

5070ti - 192bit bus, 12gb of ram, $750, 5% faster than 4070ti Super
5070 - 192bit bus, 12gb of ram, $600, 4070 super

5060ti - 128bit bus, 8gb of ram, $400, 4070
5060 - 128bit bus, 8gb of ram, $329, 4060ti
Unless Nvidia have pulled off some incredible architectural performance improvements(via clockspeed increases or sheer IPC gains), this is looking like a fairly disappointing generation. It would be nice if Nvidia gave us a bone in terms of pricing/value to compensate, but AMD seems half checked out of the GPU market by now and consumers have decided they are fine with being exploited in order to have the new shiny thing so.....
 
Unless Nvidia have pulled off some incredible architectural performance improvements(via clockspeed increases or sheer IPC gains), this is looking like a fairly disappointing generation. It would be nice if Nvidia gave us a bone in terms of pricing/value to compensate, but AMD seems half checked out of the GPU market by now and consumers have decided they are fine with being exploited in order to have the new shiny thing so.....

I don’t think there’s any scenario where the 5080 isn’t significantly faster than the 4080. So there won’t be any disappointment in raw performance if you care about model numbers. It’ll just come down to pricing.

What will that 5080 look like? No idea but the easiest option is a cut down GB202. The other option is magic.
 
They absolutely would have been low end if much larger GPU's existed on the same architecture like exists now.

150mm² is low end today, and has been for quite a while. I feel like trying to find some wormy argument to slither out of acknowledging such a basic concept is bizarre. We're not talking about some magical super advanced process beyond anything else.
You're arguing semantics in a response to a post from a moderator that such an argument really isn't adding to the conversation?

Revisit what exactly the debate is. What is it about die size that affects whatever it is people are discussing. Notions of 'high' and 'low' end don't need enter it and so don't need defining to a point everyone can agree to push the debate forwards. ;)
 
I've been testing out lossless scaling's 3x frame generation and quite impressed with it. Around low 40s is good enough to play and I'm usually picky with the input lag, though with integrated FG like DLSS/FSR, the input latency would be higher,
Anyway, the gist is that nvidia can promote Blackwell to do 3x or even 4x FG which will add onto the performance difference and displayed prominently in the slides. With how ubiquitous upscaling has become in the past few years, FG will be adopted the same way and with how good the experience is with it on, it'd be seen more and more as 'real' performance.
 
Back
Top