The G92 Architecture Rumours & Speculation Thread

Status
Not open for further replies.
If it's really "highly accessable" - no doubt about, yeah. But AFAICT this accessibility is shown only in very arithmetic-heavy environments, aka CUDA-style apps. Games in general, even newer ones, do not utilize the ALUs as excessively as i.e. AMD would like them to. So G86s changes barely had a chance to shine yet.
I'm not sure I really agree with that. The problem IMO is that people expect that entire games are ALU-limited and that unless the performance scales basically as fast as the number of GFlops, then ALUs are not the bottleneck. That's absolute nonsense!

It's really not hard to see how only half a frame or less might be ALU bottlenecked. You've got Z/Stencil/Shadow passes, and some pixels requiring high levels of AF. There also are potential ROP and triangle setup bottlenecks even in color passes... In the end, what shocks me is reasoning such as this:
http://www.anandtech.com/video/showdoc.aspx?i=2931&p=5 said:
F.E.A.R. does respond a little better to shader overclocking than Oblivion, but even at 8.3% improvement at 1600 MHz, F.E.A.R. performance doesn't even improve at half the rate shader clock speed is increased. Like Oblivion, F.E.A.R. benefits much more from increasing core clock speed.
Performance increases by 8.3% for a 15% shader clock increase (1350->1550) and they aren't happy?! Pff, whiney bitches! :)

So, if the coming generation is not going to stay in the market unchanged, i think it'd be better off with investing those transistors in a larger register file or the ominous thing which keeps the GS from being competitive with AMD or maybe even a higher triangle-rate.
I'm not sure how (much) a larger register file would help G80. But it might come in handy in G92 if they've got both a higher clock rate and higher memory latency (if they are indeed using GDDR4)...

As for triangle setup and GS performance, I completely agree. The former will already be helped a bit by the clock speed I suppose, but I'm not sure if that's really enough. As for the latter, we'll see. I'd also like to see INT8 blending improved a bit, since G8x is rather weak there compared to R6xx right now.
 
I'm not sure I really agree with that. The problem IMO is that people expect that entire games are ALU-limited and that unless the performance scales basically as fast as the number of GFlops, then ALUs are not the bottleneck. That's absolute nonsense!

It's really not hard to see how only half a frame or less might be ALU bottlenecked. You've got Z/Stencil/Shadow passes, and some pixels requiring high levels of AF. There also are potential ROP and triangle setup bottlenecks even in color passes...
Agreed. But since so many different bottlenecks can limit even within a single frame it'll be hard to prove a more than barely measurable performance increase from improved MUL-ness.

In synthetic apps i totally agree with you - especially in CUDA, where most algorithms and applications are not yet set in stone an can be optimized to take advantage of that feature.

As for the INT8-Blending rate: I didn't look into that one yet. Mind sharing a few figures?
 
Agreed. But since so many different bottlenecks can limit even within a single frame it'll be hard to prove a more than barely measurable performance increase from improved MUL-ness.
Well, yeah - but I don't think it's very hard to see that relative to G80, adding a very accessible "MUL or ADD" unit would increase effective ALU performance by ~50% when not doing attribute interpolation. If you consider 2GHz+ clock rate and that, you would get 2.5x+ more ALU performance. No matter how much of the frame that might affect, it certainly can't hurt!

One factor I am sure NVIDIA thinks about (but nobody on public forums seems to!) is how various design decisions affect their average selling prices. Think of it this way: your board ASP for a given market segment (and thus, volume!) is constant. But your *chip's* ASP is not constant: if your PCB and memory are more expensive, your ASPs must be lower to hit the appropriate price segments.

As for the INT8-Blending rate: I didn't look into that one yet. Mind sharing a few figures?
INT8, FP10 and FP16 blending all work at half-speed on G8x; so 24 ROPs can only do 12 pixels/s there. FP32 is quarter-speed. R6xx, on the other hand, does full-speed INT8/FP10/FP16 and half-speed FP32 blending.

Of course, nearly all R6xx have fewer ROPs than competing G8x. G92 vs RV670 and G98 vs RV620 look about to change that though if the rumours are accurate - so this might suddenly become a fair bit more important, I suspect...
 
Xbit Labs seem to think that G92 will be a mainstream peformance part, rather than ultra high end.

http://www.xbitlabs.com/news/video/display/20070902154658.html

Unfortunately, there is no reliable information whether Nvidia’s G92 is actually a complex high-end graphics processor, or a chip of modest complexity to replace Nvidia GeForce 8800 GTS and power a dual-GPU “GX2â€￾ graphics card.

;) Perhaps a bit of Inq-style speculation on X-bit's part?
 
Nvidia is so much better these days at keeping information leaks at bay..damn! :LOL:
 
Ye.. the last few releases(NV40 and G80) have been totally unexpected .. so atm I think there's more behind the scenes than we know with this DX10.1 release. :)

Of course there's no reason why there won't be a GTS version of the G92 .. since the G80 had the part. I think The high-end part should be pretty interesting indeed. :)

Of course... Nvidia will again have a half-year advantage before the R700 gets released so they really won't need to release the GTX version until then. The RV670 shouldn't be better than the G92 GTS .. and it's also only being released next year too.

US
 
Last edited by a moderator:
One factor I am sure NVIDIA thinks about (but nobody on public forums seems to!) is how various design decisions affect their average selling prices. Think of it this way: your board ASP for a given market segment (and thus, volume!) is constant. But your *chip's* ASP is not constant: if your PCB and memory are more expensive, your ASPs must be lower to hit the appropriate price segments.
But AFAICS higher arithmetical throughput doesn't affect memory bandwidth requirements, because the data stays mostly on chip even for loops. And making that interpolator more powerful and flexible will definitely cost transistors. Or are you referring to the option of adding more ALUs? Then I'd have to agree.

But OTOH you could just get the overall ALU clocks a little higher with simpler Units and stay within a given thermal budget at the same time.

In my opinion one of the strengths of the G8x architecture seems to be it's very high utilization which should be close to a possible max taken away the possible overhead of 16x-SIMDs (if we take Atis claim of 95% util on R520 granted).


Thanks for the heads up on Blending :)
But doesn't consume a higher blend rate also more memory bandwidth? That'll (if you want to utilize it to the max) drive components costs up so that you end up having more transistors in your chip and having to sell it for less?
 
Nvidia would be stupid not to release a high end part near the launch of Crysis. There will be tonnes of guys itching for a new card to power that beast of a game, I know personally I'm ready to drop some cash on a new high end card that will power that game with all the bells and whistles.

Easy money.
 
But doesn't consume a higher blend rate also more memory bandwidth? That'll (if you want to utilize it to the max) drive components costs up so that you end up having more transistors in your chip and having to sell it for less?
That assumes performance doesn't move, and there's no ability adjust anything else. What if those transistors let you significantly improve perf/area for that part of the chip, and you could reduce your area budget elsewhere without significant perf/area decrease? You'd want to do that, right? It's all about the balance an IHV strikes for any given shipping design.

As for utilisation, someone should profile shipping games (DX9, DX10, OpenGL, doesn't matter) on unified hardware and tell us what average chip (and sub area) utilisations are for some good frames. I'd put money on the result being a bit less rosy than people assume.
 
But AFAICS higher arithmetical throughput doesn't affect memory bandwidth requirements, because the data stays mostly on chip even for loops. And making that interpolator more powerful and flexible will definitely cost transistors. Or are you referring to the option of adding more ALUs? Then I'd have to agree.
I'm talking of increasing the number of usable GFlops any way whatsoever - it could be increasing the number of ALUs, or it could be making each multiprocessor more efficient or powerful.

The net result of that is higher performance with bandwidth requirements that remain ~constant. Because of the higher performance, you can sell your chip for more money, but your non-chip (PCB, memory, etc.) costs are roughly constant. Well, that's not completely true: power requirements might increase, which might make the PCB and the cooling more expensive. But compared to the alternatives, that's really not such a big deal I suspect.

Interestingly, ASPs are also one of the main reasons why MCP78/MCP79 will be much powerful, relative to discrete parts, than previous IGPs. NVIDIA would prefer getting a MCP78/MCP79 design win than a G86/G98 design win, because the ASPs they get there are higher! The PCB/Memory costs are basically not a factor and there's no extra memory required, while you can ask for the combined price of a chipset and a GPU...

But OTOH you could just get the overall ALU clocks a little higher with simpler Units and stay within a given thermal budget at the same time.
Indeed. Nothing prevents you from doing both though, obviously!

In my opinion one of the strengths of the G8x architecture seems to be it's very high utilization which should be close to a possible max taken away the possible overhead of 16x-SIMDs (if we take Atis claim of 95% util on R520 granted).
Sure, that's true. It's not because you're efficient that you can't also be powerful though, I guess. And regarding the MUL, the 'efficiency' of that in G80 is obviously not that good. Most of the time it's either used for interpolation, or it's idling. Errr!

Thanks for the heads up on Blending :)
But doesn't consume a higher blend rate also more memory bandwidth? That'll (if you want to utilize it to the max) drive components costs up so that you end up having more transistors in your chip and having to sell it for less?
Yes, of course - but you don't *have* to double bandwidth to use that. In chips that have high-end memory, it could be a bottleneck - in ones using DDR2 or low-end GDDR3, obviously not. Here are the necessary calculation to prove that:
8400 GS: 450MHz*4ROPs*4Bytes*2*0.5 = ~7GiB/s
8600 GTS: 675MHz*8ROPs*4Bytes*2*0.5 = ~21GiB/s

The 8400 GS has 6.4GiB/s of available bandwidth, while the 8600 GTS has 32GB/s available. Clearly, in one case it should never be a bottleneck, while in the latter it could sometimes be (although not a major one). I am not taking depth into consideration here, but there is a very good reason for that: particles don't write depth, they only read it - and that can be fairly cheap on average, especially so with Hier-Z helping. That doesn't mean it's free, but you get my point.

Also, in both cases, doing FP16 at the same speed is pure overkill. You are never going to get the bandwidth you need for that on G8x, except with high-end GDDR4 and fairly low core clock speeds!

Anyway, because G98 and G96 likely will have higher clock speeds than their predecessors and likely won't use GDDR4, this doesn't seem like much of a problem for them. Same for MCP78/MCP79 for obvious reasons. G92, on the other hand, could possibly benefit from single-cycle INT8/FP10 blending if it does use GDDR4.

It wouldn't be a major bottleneck in any sense of the term, but it is an interesting 'minor change' to ponder upon - especially so since G70->G71 made the exact same change! This time around, it might make more sense to keep the datapath width identical (32b/clock, see FP16) but double the filtering power (since FP10 likely reuses the same transistors as FP16).
 
Why not G92 in a mainstream product line, and G92X2 package, such as Q6600 for two dies in one package, is available since the launch date in Nov?


Two of them will be cheaper to produce, although G92X2 will face problems such as lack of mature driver quality before next year of performance boost in driver.
 
Well, G71 had a GX2, and yet it wasn't slower than G70. However, it is also true that it wasn't much faster! The question, this time around, is "by how much" and "for how much"...
 
Well, G71 had a GX2, and yet it wasn't slower than G70. However, it is also true that it wasn't much faster! The question, this time around, is "by how much" and "for how much"...

So, let's wait and see which configuration Nvidia adopt between two pcbs in one card or two chips in one package.
 
Another point to think about is in my opinion are the manufacturing capacities @ TSCM, which are supposed to be limited.
So a chip in G71-die-size-area would be better to make more cards.

@Vincent:

Why two dies on one package? Did you see any patent about this?
It does not make any sense for two GPUs, if you do not want connect them with 100GB/s+, which is not very easy. Intel have to do this, because they want place two dies on one socket.

But I also think, enthusiasts will get their G92GX2. Because there is still the thing about the dual-gpu-boards for Tesla and ATi will go an equal way with RV670 (*2 = R670).
 
So, let's wait and see which configuration Nvidia adopt between two pcbs in one card or two chips in one package.
I don't think we'll see two chips per package this gen. Two chips per board, who knows - AMD certainly seems to be going down that road with Gemini...

AnarchX: Well obviously fewer mm2 for a given price segment helps there. As does higher yields... Another factor to consider is that they can try to manage their SKU transitions with this in mind to prevent major shortages. If 65nm has less of a capacity problem as 80nm, then they'll try to switch ASAP, at least for the channel sales. If it's the other way around, then they might try to keep 80nm products running even if they're lower margins, just to prevent shortages.

Also, TSMC can always add more capacity to 65nm if they know in advance that they'll need it in a specific period. On the other hand, adding much more 80/90nm capacity at this point wouldn't make much sense. I'm also not sure how much of the production lines are shared between 65nm and 55nm... I'd presume most, so capacity constraints would be shared and NVIDIA/AMD will still be competing for capacity next year.

And G92x2 has always been kind of obvious in my mind, from the very second I heard the codename... But I'm not sure when they will introduce that. As noted by mozmo above, they would be crazy not to introduce it ASAP for those willing to upgrade for Crysis. But we'll see if they manage to do that or not.
 
For those folks expecting another GX2 do you expect a 7950GX2 SLI-on-a-stick approach or something more elegant? Given the track record of the first iteration and the ensuing problems with driver and application support I certainly hope it's the latter.
 
Ummm. It seems to me you did a little mix-and-match there. I'm unconvinced the physical implementation will have any significant overlap with the driver/application support.
 
Sure it does. There is a physical component to SLI that determines how the workload is distributed and how data is shared. If the physical implementation is more integrated then driver and application support could be modified to suit as inadequacies in the previous approach are addressed.
 
If one gets to that point it's probably not SLI of any flavor, "on-a-stick" or otherwise.
 
Status
Not open for further replies.
Back
Top