NVIDIA: Beyond G80...

If this G80 to G90 transition is to be anything like the NV40/NV45 to G70 one, then the performance improvement will have to be closer to 90~100%.
The 7800 GTX (256MB) was in fact faster than two SLI'd 6800 Ultras.

Yes but remember that but GF6800Ultras in SLI don`t give 90-100% performance bump over single GF6800Ultra....
When i wrote about 50% advantage over G80 i was thinking about average performance improvement.... In some situations (eye candy modes in high resolutions) it could be 60-70% faster but in others "only" 30-40%....

Imo there is the same situations like with GF6800 and GF7800.... Then was about a year between NV40 and G70 and there was no refresh (any GF6900). Now NVIDIA probably skip their G80 refresh-part (like they skipped GF6900) and will release it`s new GPU G90 about October/November (like GF7800 2 yeras ago)....

My G90 specs??
65nm
700-800 (TMU/ROP clock)
192 improved SP clocked about 1,8-2Ghz
32 TMUs (slightly improved)
24 ROPs
512-bit MC 512/1024MB
+ other minor improvements

I think that GPU with these specs can be easily at least 50% faster than GF8800GTX....
 
Although more interesting then the replacment of G80 is the chip that would fill the hole between G84 and G80 replacement. This one could be the big winner, if it comes to the market in early 3Q07.
 
Although more interesting then the replacment of G80 is the chip that would fill the hole between G84 and G80 replacement. This one could be the big winner, if it comes to the market in early 3Q07.
Or perhaps that 192SPs chip is precisely that... (with the slight 'problem' that it's competitive with G80! Good thing there are other G9x, heh?)

I think I listed some of this a few pages back, and keep in mind this is 80%+ speculation, but:
- 6 clusters, 4x8-wide ALUs per cluster, 3xInterpolators.
- 4xTexture Address units per cluster.
- 8xTexture Filtering units per cluster.
- 4xQuad ROPs, ala G84 or even beefier?
- 256-bit memory bus, 1.2-1.5GHz GDDR4.
- 2-2.5GHz shader core, 750-800MHz Core.

What's interesting there is that they can match the 8800 GTX's texture fillrate and memory bandwidth at 767MHz+ Core and with 1.35GHz+ GDDR4, while also leaving room for a GX2 and improving /mm² efficiency drastically compared to G80. It also allows for a SKU with a 192-bit memory bus...

The only 'problem' with that design is that due to the 1:1 coupling between the memory bus and the ROPs, they'd need a core clock of 862.5MHz to match the G80 in terms of ROP power, which seems rather high to say the least. Assuming I am correct and that this is the design they're working on, it'll definitely be quite interesting to see what they do with the ROPs.

If this is correct, it's not very hard to see what their lower-end derivatives would be either. You'd have one 128-bit part and a 64-bit part. The ROP count would follow as expected, while the TMUs would logically be 2/3rd and 1/3rd of the chip described above, with most likely a lower ALU ratio (why would you want to use anything but the highest-end part for GPGPU, really?) - that would also leave room for an IGP which would be ~1/6th.

Or I could be horribly wrong, of course. Either way, it'll be quite interesting to see how RV670 and R650 will turn out too, since they will be the direct competitors - but this isn't the right thread for that, obviously.
 
Arun,

I am not sure if I understood you correctly, but such a chip would have pure calculation power equal to that of 2 or 3 8800 chips at GTX speed.

That would be a monster (calculationwise) even for the high-end of the high-end segment, not to mention 'humble' upper-midrange.
 
Yeah, that's our current thinking (that they're looking to produce a FLOP monster and go after a market segment with it), but it's just speculation. Personally, I'm not entirely convinced they'll use a changed ALU structure for the lower end derivatives (if they go the route we think they are), but that's just me.
 
Personally, I'm not entirely convinced they'll use a changed ALU structure for the lower end derivatives (if they go the route we think they are), but that's just me.
Neither am I, but as far as I can see, that could be done very easily by modifying the number of multiprocessors per cluster, or the wide of each multiprocessor (although that wouldn't scale the scheduler logic as well, obviously...)

One way to consider this possible chip is R520->R580; they added 32 Vec4 ADD+MADD ALUs for only 64mm², while also expanding the register file (but not changing the scheduler, thus tripling the batch size). And the die size of other R5xx chips confirmed that this was indeed correct, and that there weren't other major optimizations to the chip that managed to hide the true cost of the new ALUs.

Consider that this chip would 'only' be adding 64 scalar ALUs (16 in Vec4 terms) and on a 65nm process instead, and that they are most likely much more custom than R5xx's... Of course, it is not fully comparable; I am assuming NVIDIA also wants to keep branching coherence constant by adding more multiprocessors rather than by increasing the ALU width (-> more scheduling logic, unlike for R5xx). The ALUs and schedulers might also be more complex (integer; unified; etc.)

So obviously it'd be crazy to expect that adding 64 SPs would only cost NVIDIA 16mm² like you would expect in the R5xx's case. However, it's probably not that hard to imagine that a chip with 6 clusters and a 2xALU ratio wouldn't be larger than one with 8 clusters and no change to the ALU ratio. And if you had to choose between the two, I'd say the 2xALU chip is much more attractive (GPGPU; higher percentage of the die being custom or semi-custom; more forward-looking; etc.) - imo, at least.
 
I think I listed some of this a few pages back, and keep in mind this is 80%+ speculation, but:
- 6 clusters, 4x8-wide ALUs per cluster, 3xInterpolators.
- 4xTexture Address units per cluster.
- 8xTexture Filtering units per cluster.
- 4xQuad ROPs, ala G84 or even beefier?
- 256-bit memory bus, 1.2-1.5GHz GDDR4.
- 2-2.5GHz shader core, 750-800MHz Core.

Dang, that other 20% must be some high-grade stuff! You guys really think that Nvidia is gonna triple G80's shader power with a "performance" part in a few months? Honestly I think that's the stuff of Jen-Hsun's dreams :) And Nvidia is developing a habit of aiming much lower than where we think they're going - G84 and the Ultra are the most recent examples of that.
 
I'm not quite sure how one product cycle indicates a habit, trini. Especially since in the last one they went way above and beyond expectations.
 
True they did knock one out of the park with G80 and very well might follow up with another big one. But I still think that's asking a lot of their first 65nm part - would love to be surprised though. If they do plan to mess around with the number of SIMD blocks (do we have a name for these yet?) per cluster I think 3x is a more reasonable estimate and maybe 4x for the tippity-top parts that also serve as their GPGPU line. Though I don't see why these parts can't be competitive with the current 2x configuration.
 
You guys really think that Nvidia is gonna triple G80's shader power with a "performance" part in a few months?
You know, this might look like a strange answer at first, but NVIDIA (and David Kirk, specifically) used to say that GPUs doubled in performance every 6 months, because they benefit fully from density, clock *and* architectural improvements.

This is obviously pure marketing, but there is some truth to it: there are a multitude of units in a modern processor, and thus youc an consider the ratios between these different kinds of units in a GPU. If you considered one of these units to be a complete and unquestionable bottleneck in every single case, and by a large degree, then doubling that specific unit will double your performance.

This is the kind of reasoning that you often saw (and sometimes still see) regarding performance improvements. Of course, the catch is that if you have units A, B and C and that one generation you double A, next-gen you double B and finally you double C in your third generation... Then even though you may claim you doubled performance three times, your peak performance under any scenario is not more than doubled compared to the initial architecture.

The reason I'm saying this is that all you're doing really here is changing the ratios, with the goal of achieving roughly the same die size as you would have without changing anything and just scaling naively from 90nm to 65nm.
Honestly I think that's the stuff of Jen-Hsun's dreams
Hehe. Well, halving your die size (and power?) while more than doubling your arithmetic power and adding FP64 support would make Jen-Hsun quite happy indeed, and I can imagine some GPGPU guys just drooling at the sheer awesomeness of that. Now if only they decided to price it competitively (HPC is very much about perf/$, as far as I can tell) thanks to their economies of scale, I can imagine things heating up quite nicely...

I'm actually assuming this chip also uses NVIO, which implies that a GPGPU version could be lacking display functionality completely. That means they could price it lower than current Quadros and not compete with them directly. I certainly think the HPC market would be quite interested (yay! another understatement!) in a "4TFlops+ in a box" solution for about $10K. That'd still be 80%+ margins for NVIDIA, and ridiculously high ASPs (considering they'd sell the whole box, rather than just the chip...)

P.S.: Not sure why you say 'their first 65nm' part - how many of those (in the high-end at least) are you expecting?! :) Depending on how 45nm goes at TSMC, which I suspect will be "rather well indeed", I wouldn't be surprised if 55nm was only used on budget parts. We'll see, interesting times ahead on many fronts either way.
 
I'm not quite sure how one product cycle indicates a habit, trini. Especially since in the last one they went way above and beyond expectations.

G71? Remember how the consensus was leaning heavily towards another 2 quads right up to almost the very end?

But then the community typically tends to engage in a little wish-fulfillment about these things.

Personally, I tend to think they aren't thrilled with the die size on G80 and will turn a decent chunk of the process move into a smaller die. This would be consistent with early messaging from the PSU guys that this gen would be :oops: :oops: :oops: on power requirements, but sanity would begin to reign again on the gen following. On the other hand, to be contrarian with myself, that "surely they can't go much farther on heat/power?" thing has been being rumbled for about three years now --see Orton's interview at R420 launch.

I'm still waiting to see if NVIO was a one-off for yield purposes or really will stick around for a GX2 (or, as Arun just pointed out, a GPGPU version) at some point.
 
The reason I'm saying this is that all you're doing really here is changing the ratios, with the goal of achieving roughly the same die size as you would have without changing anything and just scaling naively from 90nm to 65nm.

Sure, I think the architecture is fiddlesome enough that you can do some fancy things with unit counts and ratios. But I think the performance bar is just set too high for the shrink. I was actually thinking 30-60% faster than the Ultra + FP64 support + smaller cooler die would do the trick for a late 2007 part.

Hehe. Well, halving your die size (and power?) while more than doubling your arithmetic power and adding FP64 support would make Jen-Hsun quite happy indeed, and I can imagine some GPGPU guys just drooling at the sheer awesomeness of that. Now if only they decided to price it competitively (HPC is very much about perf/$, as far as I can tell) thanks to their economies of scale, I can imagine things heating up quite nicely...

Yeah and I'm interested to see how AMD plans to keep up. R600 does seem to be heavily reliant on clockspeed given its low texture/RBE unit counts but somehow I think Nvidia will have the die size advantage at 65nm.

P.S.: Not sure why you say 'their first 65nm' part - how many of those (in the high-end at least) are you expecting?! :) Depending on how 45nm goes at TSMC, which I suspect will be "rather well indeed", I wouldn't be surprised if 55nm was only used on budget parts. We'll see, interesting times ahead on many fronts either way.

Well I was expecting quite a few actually :) I was basically saying that such an impressive part was ambitious for a first shot at 65nm. Or will it not be their first shot?
 
how much space do they need for full 64 bit support?

I suppose that depends on how fast you want that support to be. As a graphics chip first, it would, I think, make more sense to run fp64 with a speed penalty. That'd still give the gpgpu guys a lot to play with....

The other place I'd spend time would be to bump the filtering unit speed and/or attempt to leverage the faster, general processor blocks (and raise the counts thereof). If you can double the filtering unit speed, you can halve the space they use, and if you can leverage the general processor blocks, you can move some percentage of the diespace of your filtering units into providing more programmable processors. R600's separation of filtering from sampling kind of suggests that route (to me anyway).

As to doubling speed every six months -- I remember that quote. I also remember when G80 shipped, and I'm looking at the Ultra, and I'm thinking 'someone is late'.

-Dave [puts away the dead horse]
 
No Arun, I think what you describe could rather end up as a new high-end-part with double mem-bus and double ROPs.

A G92 should be around 8800GTS level. Rumors of a price-cut on the GTS 640MB and GTX right on time for the HD2900XT launch, makes me believe that a replacement will be around just for back to school season aka late 3Q07.
 
No Arun, I think what you describe could rather end up as a new high-end-part with double mem-bus and double ROPs.
Fewer TMUs and more ROPs? Wouldn't that be horribly unbalanced and even less tuned for future workloads? :)

Also I'm not sure how you could have a line-up with 4 chips (which seems to be what you're suggesting; I'm suggesting 3 chips...) and have your highest-end chip measure only 200-240mm². Isn't the gap in performance too small between the different parts then? I'm also not sure how you can have a 512-bit bus on such a small chip!
 
I think current NV luine-up suggests 4 main Chips for the future.

At the moment G82 is just missing. High-End has become too big to fill the gap between the G84 style chips and the High-End.

As SLI is still not for everybody and not working everytime, a GX2 option as the only high-end offering is imho not likely for the near future. That however does not mean we will not see a GX2.
 
I don't think nV will go for the 512-bit bus. Maybe for just one chip as a stop-gap, but I expect them to move forward towards some serial bus like FlexIO or such rather soon, maybe in 2008 already.

512-bit in parallel is a pure design madness (although ATI did it and nV will likely do so with the G90).
 
Can someone explain what the point of including really slow fp64 is (by slow I mean like the DP support on the current Cell)?

From where I'm standing fp64 is a feature aimed at gpgpu. If it really is fairly castrated, then I can see high-end general purpose CPUs soon having achievable DP FLOPs at least approaching those of the GPU and that an HPC Cell would completely wipe the floor with it. Those platforms can also be scaled to much larger memory capacities/socket counts than a GPU, so ...

I'm actually expecting fairly decent fp64 performance (say 2x slower than fp32, no filtering/blending support) from the get go.
 
If it really is fairly castrated, then I can see high-end general purpose CPUs soon having achievable DP FLOPs at least approaching those of the GPU and that an HPC Cell would completely wipe the floor with it.
I think you need to serisouly look at just how much DP flops you get with a high-end CPU...
 
Well I was expecting in the very low tens for Barcelona (4 DP flops per cycle per core, 4 cores, 3 GHz => 48 GFlops peak). I'm too lazy to dig up the exact Cell numbers, but I seem to recall 20-30 DP Gflops, versus > 100 SP. This would be compared to a GPU with a slow DP implementation. Not sure what exactly "slow" is there compared to SP, and granted, the memory systems of the general purpose MPUs suck compared to the GPU... I dunno, what's your guess for DP performance of the first GPUs to support it?
 
Back
Top