The G92 Architecture Rumours & Speculation Thread

Status
Not open for further replies.
Yes, but the data point I'm considering here is 256-bit... Does it really make sense to have a ~350mm2 chip with a 256-bit bus when they could have a 320-bit one with that, or even more? Presumably the die size has to be between 200 and 300mm2 for it to make sense IMO...

Why are you thinking it will be 256 bit? I don't recall them ever going to a lower bus width between generations of high end chips. Maybe if G92 is not the top of the line...but in any case it would be very surprising if there wasn't a new high end chip this fall.

And even if it does come with a 256 bit bus, R580 and G70 were sizable chips with such a feature.
 
Wrong, I bet 100% that G92 will be a cut-down version of G80 with slightly improvement. And it could be faster than 8800GTS.


G92, in terms of the package, will be similar to Q6600, However, the another problem shall be solved before the launch of the G92. such as limitation of the TMU.


ATI must be cautious that Nvidia will no longer playthe role for the Super-High End, as to the refreshing product shortly after the G80.

What I starting to assume also is that Nvidia's G92 is a gateway GPU so that the 65nm engineering process can mature in the event they ever need to replace the 8800GTX or Ultra or if AMD/ATI aka DAMMIT releases a challenger in R650 (possibly 65nm) or a possible threat in R700 (possibly rumored to be 55nm)
 
Two-die-packages does not make much sense for graphics-cards, because here you could make solutions with two simple PCBs and so better spreaded heat sources.
Well, you'd probably do that if you wanted free cookies, but meh! :) (see sig).

But the question is, by what they are reached? Sure there could be a 8 cluster GPU with 0.8/2.4GHz clocks, but I doubt it would be called G92 and has only a 256Bit MC (an appropriate bandwidth to this 50GT/s would need 2GHz memory).
Sigh, I hate memory clocks. I hate them with a passion. I mean, it's not even like if the non-effective rates made sense; the memory chips are clocked at way less than that nowadays...

Anyway, if we presume 50GT/s per chip with a 256-bit bus, then this definitely implies 1.2GHz+ GDDR4, where the G80 GTX has 0.9GHz GDDR3. My guess is that we will see a single-chip SKU with 512MiB 1.4GHz GDDR4 as the highest-end non-multi-GPU model.

Another thing to take into consideration is that it should be easy for NVIDIA to shrink G92 to 55nm and use GDDR5 instead of GDDR4 in mid-2008. That could also explain why they might be willing to be slightly bandwidth limited with this part. And TMUs aren't as much of a limitation as some seem to think if you're smart about how you use them anyway - plus, high levels of AF require much more filtering power than bandwidth.

The bigger question probably is what they've done for the ROPs to be balanced at 800MHz with 1400MHz memory, if that is indeed the case. This would be a chip with 3% more bandwidth than the G80 GTX, but 7% less ROP power. Arguably not a big deal since G80's ROPs are overkill, but my guess is they'll have left depth alone and improved stencil & blending. The current blending part of the chip is a joke anyway... I'd expect a similar transition than G70->G71 to happen there, with the introduction of full-speed INT8 blending. AMD already has that too, so IMO they need it to be competitive with 4 ROPs SKUs...
 
That's a SKU though, not a new chip... I don't think it really counts for what we are thinking of here - although it'd be interesting to know if it was originally supposed to be based on the G81 or something along these lines.

Why does it have to be a chip? Perhaps G81 was scraped or never existed in the first place. The Ultra should give them the same effect but without effecting their margins.

I remember a rumour a while back that the G100 was the one to be pitted against the R700. Which leads me to believe that the G9x series is some type of bastard child series.

G92 as 8800gts replacer, and the G98 as a 8600 replacer. Both with better margins, and PCIe 2.0 support. That's what I'm thinking.
 
Why does it have to be a chip?
Oh, I agree it doesn't have to be from a consumer POV. But we were mostly thinking in terms of engineering roadmap AFAICT, and clearly a new SKU is nearly 'free' there. In this case, the Ultra happened at the same time as a respin, but that was probably more of a coincidence (as can be witnessed by the existence of A12 Ultras).

I remember a rumour a while back that the G100 was the one to be pitted against the R700. Which leads me to believe that the G9x series is some type of bastard child series.
G100 was never, ever the one to be pitted against the R700 AFAIK. As the codename clearly implies, it was originally meant to be two generations after G8x, while R7xx is one generation after R6xx.

and the G98 as a 8600 replacer
Some websites are claiming that G98 is a 64-bit chip, which implies that it's most likely more of a G86 replacement. But the G86 is also a 128-bit DDR2 product, so my guess is that a GDDR3 G98 and/or a DDR2 G84 with redundancy will fill that gap for a while.

And then probably around Q208, you'll have your G84 replacement. My guess is that will be a 55nm chip, very similar to G84 but with 16 addressing units/32 filtering units (instead of 16/16) and more ALU power etc... That should probably be doable at around 110-130mm2, which is much nicer than G84's ~170mm2!

EDIT: Actually thinking about that a bit more, that might be expecting too much out of the G84 replacement. If they're sticking to GDDR3 there, then they'd just become bandwidth starved. Thus, it would make more sense not to improve the TMUs, and just focus on improving ALU performance and small stuff like blending, triangle setup, etc... And then get that out for 100-115mm2.
 
I'm wondering if texturing in G84/86 is ALU-bottlenecked (similar to R520)?
I'm not sure exactly what you mean there - are you thinking of hiding latency? Obviously R580 tripling the register file helped there, but I'm not sure it's really much of a problem on G8x. But I could be wrong of course...

If what you meant is that the TMUs are idling because the GPU is ALU-bottlenecked, then that clearly must sometimes be the case, because otherwise more ALUs would result in a net gain of 0%. As for how often that is the case, I wouldn't be surprised if it was quite frequent without AF. From that POV, it would certainly be interesting to compare the GeForce 8400 and 8300's performance...
 
G100 was never, ever the one to be pitted against the R700 AFAIK. As the codename clearly implies, it was originally meant to be two generations after G8x, while R7xx is one generation after R6xx.

Well I have no idea what Nvidia originally planned or not. From what I'm seeing thus far in my perspective, the G9x series sounds like a line up refresh to meet and address new emerging technologies while also better perfecting their margins.

The 256bit rumour really does intrest me for the G92. It kinda goes hand and hand with the muttering of the G92x2.
 
And a little less power consumption - then it'd be mine (or RV670, if AMD's competitive this time and do not chose to drop midrange also to be able to adress a wider audience for a great pricepoint - *SCNR*).

Just how much more shader power does G92 need over G80 to reach say... 500gflops?
Not much - about 20% compared to the ultra. But what for? The famous 1-TFLOP-Next-Gen-Product?
 
I've been thinking about a better question lately: What would G86 need to reach 900GFlops on a MADD-only shader? I'm thinking both in terms of architecture and number of units - I'm not implying that a 80-120mm2 chip could reach 1 TFlop, thank you very much! :p
 
I've been thinking about a better question lately: What would G86 need to reach 900GFlops on a MADD-only shader? I'm thinking both in terms of architecture and number of units - I'm not implying that a 80mm2 chip could reach 1 TFlop, thank you very much! :p

Why do you think this is a better question? What do you mean "MADD-only shader" - physically or program code? Which chip's supposed to have 80mm² die size? Not G86 exactly, ain't it?

(I know - many questions...)
 
Why do you think this is a better question?
Because G86 is presumably the G8x with the most architectural changes, including how the MUL is exposed. As such, you would expect those changes not to have been done in a vacuum - they might also be reused for G9x.

What do you mean "MADD-only shader" - physically or program code?
I'm not sure what you mean by physically. I'm thinking of a shader that literally only is a bunch of non-optimizable MADDs. So that's what would be fed to the compiler - of course, this could be converted to separate MUL and ADDs by the compiler before it is sent to the hardware.

This is an extreme case, of course, but it is interesting for the following reason: under such a circumstance, because the ratio of MUL and ADDs is exactly one, the extra MUL is exactly useless. It doesn't matter what you do - you couldn't make your shader run one cycle faster thanks to this unless you somehow optimized it thanks to the order of operands (which was said to be impossible in my premise).

Why is this important? From an engineering POV, perhaps not so much. But from a marketing perspective, it could very important - and what I'm implicitly proposing is that it would be very easy (and technically fair) to market G80 as having "192 stream processors" if the MUL unit could also run an ADD, but not a MADD. It would also be a ridiculously easy and efficient way to increase average FP throughput on the basis of the G86's architecture... But I am only speculating here, obviously.

Which chip's supposed to have 80mm² die size? Not G86 exactly, ain't it?
Sorry, I edited my post regarding that - I was thinking of the approximative die size of a direct shrink of G86 to 65nm, but of course that was purely theoretical and besides the point.
 
Because G86 is presumably the G8x with the most architectural changes, including how the MUL is exposed. As such, you would expect those changes not to have been done in a vacuum - they might also be reused for G9x.
Hopefully not, but for the time being i quite liked the texturing fill rate of the original G80 - for i happen to like to play older games with all the (SSAA-) bells and whistles turned on. :)

I'm not sure what you mean by physically. I'm thinking of a shader that literally only is a bunch of non-optimizable MADDs. So that's what would be fed to the compiler - of course, this could be converted to separate MUL and ADDs by the compiler before it is sent to the hardware.
I wasn't just sure if you meant shader hardware or shader code. Thanks for clarifying that.

This is an extreme case, of course, but it is interesting for the following reason: under such a circumstance, because the ratio of MUL and ADDs is exactly one, the extra MUL is exactly useless. It doesn't matter what you do - you couldn't make your shader run one cycle faster thanks to this unless you somehow optimized it thanks to the order of operands (which was said to be impossible in my premise).

Why is this important? From an engineering POV, perhaps not so much. But from a marketing perspective, it could very important - and what I'm implicitly proposing is that it would be very easy (and technically fair) to market G80 as having "192 stream processors" if the MUL unit could also run an ADD, but not a MADD. It would also be a ridiculously easy and efficient way to increase average FP throughput on the basis of the G86's architecture... But I am only speculating here, obviously.
Interesting thoughts, but i hope Nvidia can restrain themselves in that matter. It wouldn't do them very good in the enthusiast community i'd think.

In my experience, apart from theoretical tests, G86 does not seem to profit too much from its increased MUL-rate though. So i do hope that you're wrong and that Nvidia are concentrating on delivering gaming-hp instead of marketing-hp. :)

Sorry, I edited my post regarding that - I was thinking of the approximative die size of a direct shrink of G86 to 65nm, but of course that was purely theoretical and besides the point.
No harm done - i just couldn't see, what you where poiting at.
 
Hopefully not, but for the time being i quite liked the texturing fill rate of the original G80 - for i happen to like to play older games with all the (SSAA-) bells and whistles turned on. :)
Indeed, although I'm not sure what you're getting at now either! :)

Anyway, I think we can all agree that while G80's texturing capabilities might perhaps be overkill relative to ALU performance in recent games, it certainly isn't overkill relative to bandwidth. This can easily be proven: 64TMUs*575MHz*1Byte => ~37GB/s. This is assuming normal texture cache behaviour and trilinear and/or anisotropic filtering on every texture fetch. These TMUs are only overkill when used on non-DXT/3Dc textures...

In fact, just because I'm such a nice guy, I'll do an even more interesting calculation: 64TMUs*800MHz*1Byte => ~51GB/s. Thus, even a 8800GTS on a 256-bit bus could benefit from that number of TMUs if the ROPs weren't stealing so much of that precious bandwidth! ;) And of course, such a scenario happens with long shaders: if your shader takes 100 cycles for example, then your ROPs will be idle most of the time and not taking any bandwidth...

In my experience, apart from theoretical tests, G86 does not seem to profit too much from its increased MUL-rate though. So i do hope that you're wrong and that Nvidia are concentrating on delivering gaming-hp instead of marketing-hp. :)
I'm not sure how you can really say that - I doubt G84 and G86 are perfectly identical except for the number of shader clusters. They must at the very least have been more frugal with things like cache sizes. So even disabling one cluster on G84 would not give you a perfect comparison...

Anyway even if you were right and that adding this highly accessible 'MUL or ADD' wouldn't help gaming performance much... It would clearly help FP32 CUDA a fair bit. So it would at least make sense from that POV - and fwiw, I do believe that it would be a very good decision in terms of perf/mm2 for gaming anyway.
 
If it's really "highly accessable" - no doubt about, yeah. But AFAICT this accessibility is shown only in very arithmetic-heavy environments, aka CUDA-style apps. Games in general, even newer ones, do not utilize the ALUs as excessively as i.e. AMD would like them to. So G86s changes barely had a chance to shine yet.

So, if the coming generation is not going to stay in the market unchanged, i think it'd be better off with investing those transistors in a larger register file or the ominous thing which keeps the GS from being competitive with AMD or maybe even a higher triangle-rate.

In CUDA I absolutely agree with you.
 
Status
Not open for further replies.
Back
Top