NVIDIA G92 : Pre-review bits and pieces

Per B: there is a GT with a heatsink, I think from Gigabyte.

It's from Sparkle according to Ibanez here

http://www.sparkle.com.tw/product_detail.asp?id=57&sub_id=152

8800GT_F%20copy.jpg


I wonder how hot this thing runs with that cooler. Quiet as a mouse though. :smile:

and some additional detail from Fudo http://www.fudzilla.com/index.php?option=com_content&task=view&id=3902&Itemid=34

But probably, GB may have another one too.
 
Passively cooled is nice, but I guess this "beast" still dissipates so much heat making the case fan run more instead, causing noise?!?

For me, something in-between a 8600GTS and 8800GT would be perfect as it should run cooler, but still provide enough performance for the games I play (valve source based ones). Could the G98 be the chip that will give us this card?
 
Passively cooled is nice, but I guess this "beast" still dissipates so much heat making the case fan run more instead, causing noise?!?

It's easier to shift that heat quietly with a 120mm case fan than a ~80mm or smaller fan attached to the card itself.
 
in my test, the g92's MUL 1D issue IPC of each sp is 33% slower than g80 ,but 4D issue is same to G80.
 
I thought that too at one point, so I can't blame you - my own test, as well as others' (including 3DMark06's) are very clear however: triangle setup is 0.5/clk. The triangles getting rejected based on depth tests don't seem to be an exception either...

If anyone knows of a case where you can get up to 1 tri/clk outside of backface culling, please let me know. I certainly cannot seem to find one, however! As for how the different chips on the market compare (speculated for unannounced products; note that I haven't been able to test most of those with my own tool):

R600/RV670/RV630: 1 tri/clk.
RV610: 0.5 tri/clk (iirc).

G92/G80/G84: 0.5 tri/clk.
G86/G98: 0.25 tri/clk (iirc).

the g80's setup engine is 1 triangle/cycle, not 0.5.
 
Since you're all so confident in that claim, feel free to tell me under what conditions this is true. Because I certainly can't duplicate it! (and if you can't talk about it privately, PM? :))

The only thing I haven't tested is power-of-two rendertargets (or, well, it could be an OpenGL driver bug but I somehow doubt that...) - I could see why it might make sense to accelerate that a bit further, but I don't even see how it could be any easier to compute. Hmm.
 
Feel free to post some evidence or at least a theory as to why...

'fore all, my statement was from the gaming perspective. And the word "horrible" was used due to the PR-effect.

The evidence is simple: gaming tests with shader heavy games (where the limited texture filtering capacity doesn't handicap the R600). While without AA, the 2900XT keeps up with and sometimes a bit ahead of the 8800GTX, switch on 4xAA and there's suddenly a 10-20% gap, in facour of the GTX. (I dig up a link or two if you need that, these plagued the web in May-June but right now it takes a little longer to find them)

As for the explanation, I've got a lightweight theory, here goes. I assume when the shader core does AA resolve, then all 64 superscalar units have to do that, no mixing in of PS/VS/GS code is possible. While I know the basics about how MSAA works, I'm not familiar with the exact operation - still, I assume the 16 ROPs are severely underfeeding the shader core, thereby not only "stealing" shader capacity, but also wasting a fair amount.
I also think that the R600 cannot use all of its shading power on pixel shading (which is still quite dominant in games) - I have a nasty feeling that the fifth sub-unit in each superscalar shader remains mostly idle as the shader codes of today are not parallelisation-friendly. If I'm right, then the practical PS capacity of the R600 is only 10% more than the G80 (let's forget about G80's extra MUL / SP / cycle for now) - and this is why the wasted shader capacity is so painful. Practically, the R600 walks away beaten from a number of battles it could have easily won...
 
can we start a new speculation thread about Nvidias Next High End part...since posting in the G92 speculation thread is now closed. for example i sorta doubt the G92 will be used for nvidia next gen HIGH END part. i mean it be like nvidia back in 06 releasing the 8800 GTX based on the G71 core. know what i mean? i think the next high will actually be the G100, which will yield similar performance gains seen going from a 7900 GTX to a 8800 GTX.
 
Last edited by a moderator:
If DailyTech is correct about the 8800 GTS 512/1024 MB, then that would make 5 different cards Nvidia has called the 8800 GTS :oops::

1) 8800 GTS 320MB/320bit 96/24/48/20 (G80)
2) 8800 GTS 640MB/320bit 96/24/48/20 (G80)
3) 8800 GTS 640MB/320bit 96+/??/??/20 (G80)
4) 8800 GTS 512MB/256bit 128/64/64/16 (G92)
5) 8800 GTS 1GB/256bit 128/64/64/16 (G92)

I am assuming Nvidia is allowing card #3 so they can offload old G80 cores. Card #3 has been mentioned as a 112/28/56/20 configuration, but that doesn't make much sense because they would still be practically impossible to sell vs the G92 GTS and even the GT. A 128/32/64/20 setup seems more reasonable, and the clocks on the SSC edition of EVGA's model #3 "GTS" are suspiciously similar to the GTX. But it only makes sense to cannibalize working GTX cores if you are abandoning that SKU...

EDIT: Indeed, this seems to be the case: http://www.nvidia.com/object/geforce_family.html
Nvidia Website said:
5 HD DVD / Blu-ray Video Playback on NVIDIA GeForce 8800 GTS 512MB and 8800 GT is classified as Excellent.
But don't forget about the "new" G80 version - OMGWTFBBQGTS: http://evga.com/products/pdf/640-P2-N829-A1.pdf

Which brings me to my next point kids: don't smoke crack...
 
Last edited by a moderator:
The evidence is simple: gaming tests with shader heavy games (where the limited texture filtering capacity doesn't handicap the R600). While without AA, the 2900XT keeps up with and sometimes a bit ahead of the 8800GTX, switch on 4xAA and there's suddenly a 10-20% gap, in facour of the GTX. (I dig up a link or two if you need that, these plagued the web in May-June but right now it takes a little longer to find them)
When you're digging, bear in mind two things, first broken performance in R600 drivers and second games that do Z-only passes will see a much lower zixel rate on R600 in comparison with G80 - and Z-only passes can be quite costly in terms of their percentage of frame rendering time (e.g. doing early-Z or rendering shadow maps).

You should include R580 in your comparisons, too, though beware the z-only capability on that GPU is lower when AA is off.

As for the explanation, I've got a lightweight theory, here goes. I assume when the shader core does AA resolve, then all 64 superscalar units have to do that, no mixing in of PS/VS/GS code is possible.
Yep:

http://forum.beyond3d.com/showpost.php?p=1021653&postcount=867

While I know the basics about how MSAA works, I'm not familiar with the exact operation - still, I assume the 16 ROPs are severely underfeeding the shader core, thereby not only "stealing" shader capacity, but also wasting a fair amount.
How would these ROPs not "underfeed" a ROP-based AA-resolve, if R600 didn't do shader resolve?

Anyway, it seems likely the ROPs are involved in "decompressing" an AA'd render target (because the compression information is managed by the ROPs), before the shader resolve can do anything. The resolve uses texture operations to read this render target, but AMD alludes to a "quick path". Perhaps this consists of the ROPs transmitting compression data to the ALU pipeline (via the memory cache?) - the instantaneous bandwidth consumed by AA resolve, whose samples are otherwise uncompressed, is phenomenal...

R600 can fetch 16 AA samples per clock through its texture hardware, which means 1/4 rate for the 64 ALU pipelines, or 4 cycles per AA sample per pixel which is 16 cycles per pixel. This assumes no pixel has compressed AA samples. Compression would speed this up dramatically (say 2x or 4x). Assuming the arithmetic on these samples is nothing more than sum(sample*0.25), which is 4 cycles, then the effective total shader duration is 16 cycles worst case, the time to fetch the samples. That's 724fps for a 2560x1600 render target (2.968G pixels/s).

As you say, the ALUs are bottlenecked by the fetches, running at 15% utilisation ( = 25% due to fetches x 60% due to RGB usage of 5 ALUs).

At 60fps 2560x1600, there are 193 ALU clocks per pixel (shared by vertex+pixel). 16 clocks of AA resolve amounts to 8.3%, worst case (no compression - in fact if there were no compressed pixels then the limited bandwidth available would roughly double the cost here).

So, AA resolve on its own isn't very costly. If there are multiple 2048x2048 MSAA'd shadow maps being resolved per frame, then it starts adding up, hence my earlier comment.

I also think that the R600 cannot use all of its shading power on pixel shading (which is still quite dominant in games) - I have a nasty feeling that the fifth sub-unit in each superscalar shader remains mostly idle as the shader codes of today are not parallelisation-friendly. If I'm right, then the practical PS capacity of the R600 is only 10% more than the G80 (let's forget about G80's extra MUL / SP / cycle for now) - and this is why the wasted shader capacity is so painful. Practically, the R600 walks away beaten from a number of battles it could have easily won...
Games (on G80/R600) are rarely ALU-limited. It seems fillrate or bandwidth or texture rate always gets you first.

Also, G80's apparent "efficiency" is not all a gain, e.g. it uses its ALUs to interpolate attributes (texture coordinates) whereas R600 has dedicated hardware. It's good for ALU utilisation in G80, but it can easily hurt pixel throughput too. e.g. bilinear texturing in G84 and G92 appears to show a severe ALU bottleneck (when compared against theoretical).

Jawed
 
When you're digging, bear in mind two things, first broken performance in R600 drivers and second games that do Z-only passes will see a much lower zixel rate on R600 ...

As I see, nothing is much broken beyond the 8.37 drivers, only isolated cases mostly on AA. Unfortunately I don't know which games are z-pixel heavy, but here's my selection: Oblivion, Prey, Half-Life 2 Lost Coast, and Rainbow Six: Vegas.

Here's a good test including the 1950XTX also - although it's with the 7.5 drivers, this much newer test shows that the 7.10 drivers didn't add much - the exception is Oblivion which did have serious issues with AA. Anyway, the best demonstration is Prey.

How would these ROPs not "underfeed" a ROP-based AA-resolve, if R600 didn't do shader resolve?

They wouldn't, they just wouldn't be stealing shader cycles. I disagree with what you say later about the G80 not being ALU limited - the test above demonstrates around 40% difference between the GTS and the GTX, which cannot really come from elsewhere.

More later, I have to be running now.
 
AFAIK, Source engine employs a depth-test pass prior to each frame being rendered. Probably this is the reason HL2 & Co. to gain a little more on G80 hardware than R600, despite being obviously not-so-much demanding game platform.
Soft-particle effects are also z/s rate limited.
 
As I see, nothing is much broken beyond the 8.37 drivers, only isolated cases mostly on AA.
Er... precisely. AMD says that stability of the drivers has been the priority, not performance.

Unfortunately I don't know which games are z-pixel heavy, but here's my selection: Oblivion, Prey, Half-Life 2 Lost Coast, and Rainbow Six: Vegas.

Here's a good test including the 1950XTX also - although it's with the 7.5 drivers,
Launch drivers, only useful in demonstrating that high profile games get their drivers tweaked while the rest are broken.

this much newer test shows that the 7.10 drivers didn't add much - the exception is Oblivion which did have serious issues with AA. Anyway, the best demonstration is Prey.
Prey always performs significantly better on HD2900XT than on X1950XTX, around 26% better with AA. According to theoreticals, Prey should be ~14% faster.

So, shader AA resolve is definitely not hurting in Prey (but then it shouldn't, it's not an ALU-heavy game as far as I can tell, and being "low-poly" there'll be a fair amount of AA compression to ease bandwidth).

They wouldn't, they just wouldn't be stealing shader cycles. I disagree with what you say later about the G80 not being ALU limited - the test above demonstrates around 40% difference between the GTS and the GTX, which cannot really come from elsewhere.
There's 50% theoretical ALU difference between them, 53% texturing, 38% fillrate and 35% bandwidth :p

As I said earlier, NVidia has attribute interpolation "stealing cycles".

ATI's rationale for shader AA resolve is sane: any game that does HDR (or deferred rendering) looks better with developer programmed AA - fixed function AA either looks wrong or doesn't work.

Jawed
 
Er... precisely. AMD says that stability of the drivers has been the priority, not performance. (...)
Launch drivers, only useful in demonstrating that high profile games get their drivers tweaked while the rest are broken.

AFAIR, 8.37 was the first so-called performance driver. But even if it was performance for the most popular games, it still fits our needs :smile:

Prey always performs significantly better on HD2900XT than on X1950XTX, around 26% better with AA. According to theoreticals, Prey should be ~14% faster.

I see your point about the XTX, although I was comparing to the 8800GTX all along. I ran through most of the tests, and sure enough, with 4xAA 16xAF on higher resolutions, the difference between the two Radeons are more often that not around the difference in the clock speeds, while the 8800 cards performs significantly better (with a much bigger margin than the difference in ROP capacity). As I assume that the AA fetches don't go through the texture filters, only one explanation remains - can all that bigger difference be down to AF performance? Ouch.

There's 50% theoretical ALU difference between them, 53% texturing, 38% fillrate and 35% bandwidth :p

Ummmm. Right. :oops: Maybe I should give up calculating everything in head...
If games are mostly limited elsewhere, not by ALU capacity, then I need to rethink my approach a little...

ATI's rationale for shader AA resolve is sane: any game that does HDR (or deferred rendering) looks better with developer programmed AA - fixed function AA either looks wrong or doesn't work.

And in any case, DX10 requires it. I was simply against omitting the hardware support for "traditional" MSAA.

About the second part of your previous post, if ALU capacity is not the limiting factor, then the 8.3% difference really doesn't amount to much. The bandwidth idea is interesting though - if that's the reason for the problematic AA performance, then the RV670 can solve it in an elegant manner instead of reinstalling hardware support for the "traditional" resolve.

Hmmm. You gave me a few things to mull over. Thx! :smile:
 
So Jawed, now much longer until we can drop the "it's all drivers" and "the focus is on stability" explanation for R600 performance abnormalities? 2009? 2010? I mean, it's not like it reached the end of it's f-ing life or anything at this point.
 
Er... precisely. AMD says that stability of the drivers has been the priority, not performance.
...........
Jawed
Ok, just tell us why although R600 has more bandwidth and more ALU power than G80 its beaten again and again.
To me that means bad balance of priorities, do you think something else, and where/how do you think it can be improved?
 
Back
Top