8800GTX Shadermark results

Skrying · Nov 6, 2006

Ailuros said:
So you think someone cannot see what the bottleneck in only one system is with while running some tests? You don't need to compare a gazillion systems to see if a system is CPU bound; one indication would be that you all of the sudden get in resolution X GPU specific functions nearly for free and the next best riddle is what a four legged animal walking on a hot tin roof might be

In games, which is what I was talking about, increased CPU power can certainly still provide benefits at higher resolutions. Therefore the idea that you'll need a good CPU to get the most out of the newer cards is perfectly valid.

Geeforcer · Nov 6, 2006

The vast majority of 3D-intensive games would see very little to no benefit from faster CPU at high resolution, provided you are not running them on a Pentium 3 in the first place.

ChrisRay · Nov 6, 2006

Geeforcer said:
The vast majority of 3D-intensive games would see very little to no benefit from faster CPU at high resolution, provided you are not running them on a Pentium 3 in the first place.

It is there. It's just not huge. This will be more relevant to SLI/Crossfire setups too as they have an additional CPU overhead.

Chris

Geeforcer · Nov 6, 2006

SLI/Crossfire are indeed a different case since they are also are less likely to be GPU-bound. Otherwise, if HardOCP's horrendous Core 2 review has taught us anything (as if we didn't know that already) it's that there is that there is pretty much no CPU scaling in graphics-bound scenarios.

Ailuros · Nov 6, 2006

Skrying said:
In games, which is what I was talking about, increased CPU power can certainly still provide benefits at higher resolutions. Therefore the idea that you'll need a good CPU to get the most out of the newer cards is perfectly valid.

All you need is an extremely high resolution and a healthy amount of AA samples. Try yourself something like 2048*1536 with 8xAA and we'll see if and how much an even stronger CPU will make a difference (given - as marked - you're not running it initially with a Celeron heh).

My point still stands; it's more than easy to find in a single system its bottlenecks. You can there define in which cases you're GPU and which case you're CPU limited. Past 1600*1200 with AA enabled chances are miniscule that you'll see more than a half frame difference, even though some may still insist that a single digit persentage is still a "difference".

silent_guy · Nov 6, 2006

Skrying said:
What is your CPU? How do you know, unless of course you've done a wide range of testing, that you would not gain from the extra CPU power?

You're missing my point which is that it is entirely self evident (unless you name is Fuad) and thus redundant that, all other things equal, a faster GPU will increase the likelyhood of an application to become CPU limited.

I just find it strange that is suddenly being 'discovered' as being particularly special for the 8800GT(X|S).

Especially for a GTS which is allegedly about as fast as a 7950GX2.

(Unless, of course, there is a fundamental architectural difference in the driver that increases the load of the CPU for an 8800GTS. Now that would be a much interesting thing to look at, but without context we can't know.)

KimB · Nov 6, 2006

We get claims of, "don't use this video card unless you have CPU X," all the time. It's basically nonsense: a better CPU gets you higher framerates, while a better GPU nets you higher resolutions and better graphics quality settings for the same framerates. These rules aren't exact, of course, but are reasonably good rules of thumb.

Reputator · Nov 6, 2006

Mintmaster said:
Arithmetic and texturing perf looks to be over 2 times the G71 (128 scalar MAD+MUL units at 1350 MHz would explain it), but I'm not seeing enough differences in shading characteristics from G71 that would suggest a very different base architecture. Texture decoupling, if it did happen, doesn't seem to have helped one shadermark test over the other (relatively speaking), at least not to the degree of R420->R520. For similar reasons I'm doubtful about truly independent scalar ALUs.

Well, if both the 8800GTX and GTS have the same number of texture units (64) despite having a different amount of shader units, as the rumored theoretical texel fillrates would suggest, that would indicate that they are decoupled.

However, there's something going on with these 128 shaders that I'm obviously not getting. Say NVIDIA kept the efficiency of the G7x line, which they wouldn't have any reason not to with such a large amount of them. That would mean the G80 should have several times the raw shader performance of the G71, not just a little over double.

Wild speculation here, if the texture units aren't decoupled, and the rumored texture rates on the GTS are wrong, then say the G80 architecture has a consistent ratio of 2:1 shader to texture units, with the GTS having 48 instead of 64. What if NVIDIA took a wild marketing turn to try and claim the dual math units make the shader units like double what they actually are, instead of 128 you really have 64. So that would make the texture ratio more like 1:1 like all the previous architectures. Due to being unified, not only do you have texturing holding back pixel shading performance you also have vertex instructions to contend with. That would make the ~2x pixel shader performance of the G71 more reasonable.

Mintmaster said:
Not that any of this really matters, as it's insanely fast. The VS also looks speedy for a non-unified design. IMO, R600 will have a tough time matching this unless ATI also made GHz shader engines. Hopefully AMD can help them out in this area for next-next gen.

As many have pointed out, I do think the results point to a unified architecture.

Xmas · Nov 6, 2006

Reputator said:
However, there's something going on with these 128 shaders that I'm obviously not getting. Say NVIDIA kept the efficiency of the G7x line, which they wouldn't have any reason not to with such a large amount of them. That would mean the G80 should have several times the raw shader performance of the G71, not just a little over double.

Why, if it has less than double the shader FLOPS?

Mintmaster · Nov 6, 2006

Reputator, those are pretty good points.

For the decoupling, I wasn't so much referring to physical layout as much as the dependencies within the pipeline. In G7x the arithmetic units are used to (partially?) calculate the texture address, and if a texture load is stalled due to bandwith or other reasons, the pipeline will stall irrespective of non-dependent math ops that could be done. At least that's the picture painted by GPUBench.

You could still have this type of dependency with strange TEX:ALU ratios for the GTS if the scheduler was up to it, and from the similarity between G80 and G71 in results across the shader tests (aside from that beefy 2-2.5x scale factor

), I'm guessing that's indeed the case. More tests are needed, though, as it's not very solid proof.

Nonetheless, I definately think that the TEX:ALU ratio is approximately the same as G71. We'd definately see at least some differences from test to test in Shadermark. Just look at how varied the per-test improvements were from R520 to R580.
(EDIT: Wait, I made a mistake in looking at the Archmark numbers. The bilinear numbers aren't much higher on G80. But the texture laden tests in Shadermark double with G80. Hmmm...)

I considered 64 dual issue vector shader pipes as well, but if they were running at 1350 MHz (which I admit isn't confirmed yet), G80 would be around 5x the speed of G71. But at 575 MHz it makes perfect sense. I'm skeptical about the fully unified architecture (i.e. VS/PS). It doesn't seem to mesh with their general philosophy. Another possibility is that there are 32 MAD+MUL vector shader pipes for the PS, 24 in the VS, and 8 in the GS. If all ran at 1350 MHz, you'd get similar performance to what we're seeing, and it'd be 128 total vector ALUs.

We'll see soon enough...

jimpo · Nov 6, 2006

BlizzardOne said:
Put X1950, X1950CF and 8800GTX results into a pretty graph for those who like graphs and stuff..

Very odd that some of the CF scores are lower than the single card scores though..

Apologies for being such a nit, but you probably picked the chart type which makes comparisons between the cards the most difficult...

Any chance of a normal bar-graph where each card's bars are independent, not combined?

DemoCoder · Nov 6, 2006

Mintmaster said:
Reputator, those are pretty good points.

Actually, I think the shader benchmarks shown so far show that efficiency has improved. 128 scalar ALUs @ 1350Mhz =~ 172GFlops. G71 48x4 @ 675Mhz = 129GFlops. So a G80 only has about 1.33x more raw ALU power. If you assume an efficiency increase, from say, 60% utililization to 90% utilization, you end up with 1.99x.

Nonetheless, I definately think that the TEX:ALU ratio is approximately the same as G71. We'd definately see at least some differences from test to test in Shadermark. Just look at how varied the per-test improvements were from R520 to R580.

What are you calling an "ALU" and a "TEX" unit in the G80? Since the ALUs are scalar, the comparison doesn't make as much sense, unless you want to say "TEX:ALU/4 ratio". The benchmark evidence points atleast to a doubling of bilerp resources because trilinear is "free"

I think there's been enough independent leaks and confirmations, including people picking up actual G80 cards from clueless retail stores, that the 128 scalar unit spec is probably the safe bet. Which means the G80 has significant improvements in efficiency and is not a brute-force chip (save for the memory bus). I also think that it's probably fully unified with no separation of VS/GS from PS and that Kirk's constant denials were simply misdirection and downplaying a competitor's architecture even as they were preparing to launch their own implementation.

Ailuros · Nov 6, 2006

DemoCoder said:
Actually, I think the shader benchmarks shown so far show that efficiency has improved. 128 scalar ALUs @ 1350Mhz =~ 172GFlops. G71 48x4 @ 675Mhz = 129GFlops. So a G80 only has about 1.33x more raw ALU power. If you assume an efficiency increase, from say, 60% utililization to 90% utilization, you end up with 1.99x.

G71 = 24 (double pumped ALUs) * 16 FLOPs * 0.65GHz = 249.6GFLOPs

and if I am to trust the piled up rumours about G80:

128 * 3 FLOPs * 1.35GHz = 518.4 GFLOPs

There are just the VS FLOPs missing from the above G71 math to get it more accurate.

Reputator · Nov 6, 2006

Mintmaster said:
Reputator, those are pretty good points.

For the decoupling, I wasn't so much referring to physical layout as much as the dependencies within the pipeline. In G7x the arithmetic units are used to (partially?) calculate the texture address, and if a texture load is stalled due to bandwith or other reasons, the pipeline will stall irrespective of non-dependent math ops that could be done. At least that's the picture painted by GPUBench.

You could still have this type of dependency with strange TEX:ALU ratios for the GTS if the scheduler was up to it, and from the similarity between G80 and G71 in results across the shader tests (aside from that beefy 2-2.5x scale factor ), I'm guessing that's indeed the case. More tests are needed, though, as it's not very solid proof.

Nonetheless, I definately think that the TEX:ALU ratio is approximately the same as G71. We'd definately see at least some differences from test to test in Shadermark. Just look at how varied the per-test improvements were from R520 to R580.
(EDIT: Wait, I made a mistake in looking at the Archmark numbers. The bilinear numbers aren't much higher on G80. But the texture laden tests in Shadermark double with G80. Hmmm...)

I considered 64 dual issue vector shader pipes as well, but if they were running at 1350 MHz (which I admit isn't confirmed yet), G80 would be around 5x the speed of G71. But at 575 MHz it makes perfect sense. I'm skeptical about the fully unified architecture (i.e. VS/PS). It doesn't seem to mesh with their general philosophy. Another possibility is that there are 32 MAD+MUL vector shader pipes for the PS, 24 in the VS, and 8 in the GS. If all ran at 1350 MHz, you'd get similar performance to what we're seeing, and it'd be 128 total vector ALUs.

We'll see soon enough...

well, 32 pixel shaders would also make more sense of the shader performance, you're right. Also a separated design (non-unified) would make the amount of texture units, if you've still got conjoined ALU/TMU units in there reminescent of the GF6 and 7, more understandable.

If they are unified, and have conjoined ALUs, that means the shader units either still aren't 100% uniform, some have texture capabilities, some don't, or they are uniform and some of the units' texture capabilities are locked off, either way resulting in a definable number of TMUs, while still integrated in the shaders, while still being unified.

But that sounds unlikely to me, as it would be harder to do. I can't argue against the characteristics of the performance we're seeing here, I don't have analyzing skills, but take into account the fact that the texturing capabilities are greatly enhanced, not just in performance but in abilities. Improved efficiency, multiple modes of AF, clearly the texture units have been beefed up. As many shader units as there are, you'll run into further yield complications (I should think) having math units with full capabilities that can also do some really robust texturing routines. It'd make more sense from that perspective to decouple them from the shaders, so you can manipulate them as you please.

About the performance, it's just a thought but the drivers used on these may not be able to take full advantage of decoupled texture units yet. They're still largely tailored for the GF6/7 architecture.

DemoCoder · Nov 6, 2006

Ailuros said:
G71 = 24 (double pumped ALUs) * 16 FLOPs * 0.65GHz = 249.6GFLOPs

and if I am to trust the piled up rumours about G80:

128 * 3 FLOPs * 1.35GHz = 518.4 GFLOPs

There are just the VS FLOPs missing from the above G71 math to get it more accurate.

A nit, the G71 isn't double pumped (2x clock rate), it's two separate ALUs. But in any case, how are you attributing 16 FLOPs per ALU? It you count MADDs as 2 FLOPs, I suppose you could say it's 8 FLOPs per ALU, without counting NRM_16. How do you come up with 3 FLOPs per scalar ALU on G80? If it has a scalar MADD, you can count 2 FLOPs, where does the other come from? Are you counting special functions in the interpolators?

PeterAce · Nov 6, 2006

DemoCoder said:
A nit, the G71 isn't double pumped (2x clock rate), it's two separate ALUs. But in any case, how are you attributing 16 FLOPs per ALU? It you count MADDs as 2 FLOPs, I suppose you could say it's 8 FLOPs per ALU, without counting NRM_16. How do you come up with 3 FLOPs per scalar ALU on G80? If it has a scalar MADD, you can count 2 FLOPs, where does the other come from? Are you counting special functions in the interpolators?

He is counting the presumed/speculated MADD and MUL units in each G80 ALU.

MADD = 2 FLOPs
MUL = 1 FLOP.

Also on G70 he is thinking :

24 MADD x 16 FLOPs = 48 MADD x 8 FLOPs.

dnavas · Nov 6, 2006

I thought the G80 was MADD + MUL? I don't recall where we got that piece of "information", though.

As for G71, wouldn't it be (roughly, as you note) Vec4 MADD * 2/pipe * 24 pipes? 4 * 2 * 2 * 24.....

PeterAce · Nov 6, 2006

Nvidia like an extra MUL (both NV40 and G80) and ATI like and extra ADD (both R4X0, R5X0 and interestingly R500/C1's extra Scalar ADD + SF)

Mintmaster · Nov 6, 2006

DemoCoder said:
Actually, I think the shader benchmarks shown so far show that efficiency has improved. 128 scalar ALUs @ 1350Mhz =~ 172GFlops. G71 48x4 @ 675Mhz = 129GFlops. So a G80 only has about 1.33x more raw ALU power. If you assume an efficiency increase, from say, 60% utililization to 90% utilization, you end up with 1.99x.

It all depends on the assumptions. My best guess was one of two things:

1) 128 scalar stream processors (MAD+MUL) either in the PS alone or unified between VS/PS. No particular reason for choosing MAD+MUL, I just thought it might be easier to hit 1350 MHz, and dual-MAD didn't seem to help G7x too much in shader tests over NV4x.

2) 128 vector ALU's (1 MAD each) split among the PS, VS, and GS as 32, 24, and 8 shader pipes respectively, each of which are MAD-MAD capable like in G7x.

In any case, my remark about efficiency by going the scalar route or decoupling the texture units was not based on average gain, but that there should be more variability among the specific gains for each test. The gains looked a bit too uniform to me. Not perfectly so, but nothing compared to R580 vs R520. This is decent evidence against a notably changed TEX:ALU ratio in G80 from G7x, but I was also going out on a limb by saying it's likely evidence that the scalar processing didn't get a big relative boost compared to vector.

But looking a bit closer now I think I was unjustified in suggesting that second claim, as there are indeed a couple of variations among the tests. Better data would be nice. Why hasn't anyone with a 7900GTX given us some 1600x1200 data? I'm trying to correlate with other benchmarks on the web, but the R580 scores aren't lining up perfectly with BlizzardOne's. Different Shadermark builds maybe?

Back to the texturing, my point was that wherever R580 struggles to improve over R520, G80 is usually 2-2.5x G71 (except the HDR tests), just like in all the other tests. I really doubt that free trilinear is responsible for this at all, since at 1600x1200 Shadermark is probably only using the top-level mipmap for most pixels. Could it be free FP16 filtering, free volume fetch, free PCF, free env map address calc, free FP texture read? That theory seems a little too ad-hoc to me, and still doesn't explain everything.

To me it looks like G80 has around twice the texturing speed of G71 to ~match the computational speed increase. The Archmark results contradict that, but I find free trilinear to be quite odd in the first place. I'm really not sure what's going on.

(BTW, when comparing TEX:ALU ratios, I'm talking about net output.)

DemoCoder · Nov 7, 2006

Well, I think you are correct in that the ratios haven't gone down, if anything, more TEX power is available per ALU I bet. Likewise, ROP power has gone up, atleast effective ROP power (MSAA, HDR, memory bandwidth). Whether you view this as a good thing depends on assumptions of what developers will do with TEX:ALU. I think already the R580, Xenos, even RSX/G71, have massive amounts of ALU power that has gone unused, in part, because figuring out what to do with it and maximize efficiency is hard. But using more texturing, fillrate, and bandwidth power is much easier to figure out from a developer perspective.

8800GTX Shadermark results

Skrying

S K R Y I N G

Geeforcer

Harmlessly Evil

ChrisRay

<span style="color: rgb(124, 197, 0)">R.I.P. 1983-

Geeforcer

Harmlessly Evil

Ailuros

Epsilon plus three

silent_guy

KimB

Reputator

Xmas

Porous

Mintmaster

jimpo

DemoCoder

Ailuros

Epsilon plus three

Reputator

DemoCoder

PeterAce

dnavas

PeterAce

Mintmaster

DemoCoder

Similar threads