Apologies for a long post with lots of numbers coming up ahead...
I must admit to having no idea how you are deriving that result. Are you comparing A:B performance of a single ALU?
Taking the results for the 1024x768 test case shown, and assuming that in such an ALU limited test the scaling is near-linear with engine clock rate I make the scaled X1900 performance at 550 MHz to be something like 97.6 fps compared to 66.4 for G70. Scaling by (24/16) for number of fragment pipes means that with 24 pipes the R580 would theoretically perform at 146.4 frames per second, or 2.2 times the performance per shader pipe when compared to G70.
Comparing X1800 against 7800GTX, and scaling similarly by clock rate I make the scaled X1800 performance 40.9 frames per second at 550 MHz. Scaling by 24/16 for number of pipes would give 61.35 frames per second, so scaling to equal clock rates and pipe counts it would appear to me that the G70 fragment pipeline is performing about 8% better per clock on this test than an X1800. Now, given the fact that G70 supposedly has an entire additional MAD unit, and that this test is very heavy on the ALU instructions, that doesn't sound like a huge delta to me.
In these numbers I am ignoring any potential performance gains for the 7800 from its higher memory clock - the effects in a heavily ALU limited test are probably small.
Where does your figure of a 35% advantage per fragment pipe of the G70 come from?
Performing the same analysis as above I get the following -
Steep Parallax mapping
X1800 at same clock rate as G70 with same pipe count = 56 * 550 / 625 * 24 / 16 = 73.92 fps
Per pipe performance for X1800 compared to 7800GTX = 73.92/22 * 100 = 336%
X1900 at same clock rate as G70 with same pipe count = 66 * 550 / 650 * 24 / 16 = 83.76 fps
Per pipe performance for X1900 compared to 7800GTX = 83.76/22 * 100 = 380%
Procedural Fur
X1800 at same clock rate as G70 with same pipe count = 25 * 550 / 625 * 24 / 16 = 33 fps
Per pipe performance for X1800 compared to 7800GTX = 33 / 9 * 100 = 366%
X1900 apparently has the same performance on this test as X1800 (unusual, but possible if it's very branch-intensive)
Per pipe performance for X1900 compared to 7800GTX = 625/650 * 366 = 352%
Let's look at the "texture intensive" tests, note that since these two tests are apparently texturing intensive the 7800GTX's higher memory bandwidth is also probably coming into play quite significantly in these performance figures, but I have not accounted for this in my analysis:
PS2 parallax mapping (partial precision)
X1800 at same clock rate as G70 with same pipe count = 291 * 550 / 625 * 24 / 16 = 384.1 fps
Per pipe performance for X1800 compared to 7800GTX = 384.1 / 462 * 100 = 83.1%
X1900 at same clock rate as G70 with same pipe count = 373 * 550/650 * 24/16 = 473.4 fps
Per pipe performance for X1900 compared to 7800GTX = 473.4 / 462 * 100 = 102.5%
Frozen Glass (partial precision)
X1800 at same clock rate as G70 with same pipe count = 632 * 550 / 625 * 24 / 16 = 834.2 fps
Per pipe performance for X1800 compared to 7800GTX = 834.2/766* 100 = 109%
X1900 at same clock rate as G70 with same pipe count = 683 * 550/650 * 24/16 = 866.9 fps
Per pipe performance for X1900 compared to 7800GTX = 866.9 / 766 * 100 = 113%
G70 wins one test at partial precision by about 20% and loses the other by 9% against an X1800 per-clock per-pipe
By the same metric it loses by 2.5% in one test and 13% in the other against X1900
PS2 parallax mapping (full precision)
Per pipe performance for X1800 compared to 7800GTX = 384.1 / 412 * 100 = 93.2%
Per pipe performance for X1900 compared to 7800GTX = 473.4 / 412 * 100 = 114.9%
Frozen Glass (full precision)
Per pipe performance for X1800 compared to 7800GTX = 834.2/713* 100 = 117%
Per pipe performance for X1900 compared to 7800GTX = 866.9 / 713 * 100 = 121%
G70 wins one test by 7% over X1800 and loses the other by 17% per-pipe per clock
By the same metric it loses to X1900 by 15% and 21% respectively.
And now the "ALU intensive" versions
PS2 parallax mapping (partial precision)
X1800 at same clock rate as G70 with same pipe count = 256 * 550 / 625 * 24 / 16 = 338 fps
Per pipe performance for X1800 compared to 7800GTX = 338.1 / 470* 100 = 71.9%
X1900 at same clock rate as G70 with same pipe count = 619 * 550/650 * 24/16 = 785.7 fps
Per pipe performance for X1900 compared to 7800GTX = 785.7 / 470 * 100 = 167.2%
Frozen Glass (partial precision)
X1800 at same clock rate as G70 with same pipe count = 663 * 550 / 625 * 24 / 16 = 875.2 fps
Per pipe performance for X1800 compared to 7800GTX = 875.2/877* 100 = 99.8%
X1900 at same clock rate as G70 with same pipe count = 1035 * 550/650 * 24/16 = 1313.7 fps
Per pipe performance for X1900 compared to 7800GTX = 1313.7 / 877 * 100 = 149.8%
At partial precision per-pipe per-clock G70 wins one test against X1800 (which runs at full precision) by around 40%, and basically ties the other case.
By the same metric it loses both tests against X1900 by 67% in one test and 50% in the other.
PS2 parallax mapping (full precision)
Per pipe performance for X1800 compared to 7800GTX = 338.1 / 353 * 100 = 95.8%
Per pipe performance for X1900 compared to 7800GTX = 785.7 / 353 * 100 = 222.6%
Frozen Glass (full precision)
Per pipe performance for X1800 compared to 7800GTX = 875.2/773* 100 = 113%
Per pipe performance for X1900 compared to 7800GTX = 1313.7 / 773 * 100 = 170%
At full precision G70 trades wins in these tests with X1800 in per-pipe per-clock performance.
G70 loses to an X1900 by 120% in one test and 70% in the other test by the same metric.
From these particular tests I don't see any indication that G70's per-pipe shader architecture scales better than X1900 when the texture instruction count is high, but I see plenty of indications that X1900's shading performance advantage over G70 increases significantly per pipe as the shaders become ALU intensive. I don't see where any conclusion that a G70 pipeline behaviour is significantly more 'graceful' in it's scaling in either direction can be derived.
In these particular tests I see very little indication that a G70 pipeline running at equivalent (full) precision can outperform that of even an R520 by any significant margin, let alone an R580. There are evidently some cases where it can do quite well against R520 when it is allowed to run in partial precision against the R520 running at full precision.
The dynamic branching performance results speak for themselves.
[edit] Added analysis of some more of the quoted tests, and cleaned it up.[/edit]
Jawed said:Truely intense arithmetic tests seem to be all over the shop:
http://www.digit-life.com/articles2/video/3dmark06/3dmark06_11.html
which shows a 35% advantage per fragment pipe for G7x.
I must admit to having no idea how you are deriving that result. Are you comparing A:B performance of a single ALU?
Taking the results for the 1024x768 test case shown, and assuming that in such an ALU limited test the scaling is near-linear with engine clock rate I make the scaled X1900 performance at 550 MHz to be something like 97.6 fps compared to 66.4 for G70. Scaling by (24/16) for number of fragment pipes means that with 24 pipes the R580 would theoretically perform at 146.4 frames per second, or 2.2 times the performance per shader pipe when compared to G70.
Comparing X1800 against 7800GTX, and scaling similarly by clock rate I make the scaled X1800 performance 40.9 frames per second at 550 MHz. Scaling by 24/16 for number of pipes would give 61.35 frames per second, so scaling to equal clock rates and pipe counts it would appear to me that the G70 fragment pipeline is performing about 8% better per clock on this test than an X1800. Now, given the fact that G70 supposedly has an entire additional MAD unit, and that this test is very heavy on the ALU instructions, that doesn't sound like a huge delta to me.
In these numbers I am ignoring any potential performance gains for the 7800 from its higher memory clock - the effects in a heavily ALU limited test are probably small.
Where does your figure of a 35% advantage per fragment pipe of the G70 come from?
The two PS3 tests (Steep Parallax Mapping and Fur) show the opposite, though:
http://www.digit-life.com/articles2/video/r580-part2.html
27% and 18% advantage per pipe in favour of R580 - but they prolly make use of dynamic branching as a performance tweak.
Performing the same analysis as above I get the following -
Steep Parallax mapping
X1800 at same clock rate as G70 with same pipe count = 56 * 550 / 625 * 24 / 16 = 73.92 fps
Per pipe performance for X1800 compared to 7800GTX = 73.92/22 * 100 = 336%
X1900 at same clock rate as G70 with same pipe count = 66 * 550 / 650 * 24 / 16 = 83.76 fps
Per pipe performance for X1900 compared to 7800GTX = 83.76/22 * 100 = 380%
Procedural Fur
X1800 at same clock rate as G70 with same pipe count = 25 * 550 / 625 * 24 / 16 = 33 fps
Per pipe performance for X1800 compared to 7800GTX = 33 / 9 * 100 = 366%
X1900 apparently has the same performance on this test as X1800 (unusual, but possible if it's very branch-intensive)
Per pipe performance for X1900 compared to 7800GTX = 625/650 * 366 = 352%
The PS2 tests on that page, Parallax Mapping and Frozen glass show a heavy dependency on _PP for G70. In FP32, though, the former shows a 35% advantage for G70 while the latter shows a 79% advantage.
Let's look at the "texture intensive" tests, note that since these two tests are apparently texturing intensive the 7800GTX's higher memory bandwidth is also probably coming into play quite significantly in these performance figures, but I have not accounted for this in my analysis:
PS2 parallax mapping (partial precision)
X1800 at same clock rate as G70 with same pipe count = 291 * 550 / 625 * 24 / 16 = 384.1 fps
Per pipe performance for X1800 compared to 7800GTX = 384.1 / 462 * 100 = 83.1%
X1900 at same clock rate as G70 with same pipe count = 373 * 550/650 * 24/16 = 473.4 fps
Per pipe performance for X1900 compared to 7800GTX = 473.4 / 462 * 100 = 102.5%
Frozen Glass (partial precision)
X1800 at same clock rate as G70 with same pipe count = 632 * 550 / 625 * 24 / 16 = 834.2 fps
Per pipe performance for X1800 compared to 7800GTX = 834.2/766* 100 = 109%
X1900 at same clock rate as G70 with same pipe count = 683 * 550/650 * 24/16 = 866.9 fps
Per pipe performance for X1900 compared to 7800GTX = 866.9 / 766 * 100 = 113%
G70 wins one test at partial precision by about 20% and loses the other by 9% against an X1800 per-clock per-pipe
By the same metric it loses by 2.5% in one test and 13% in the other against X1900
PS2 parallax mapping (full precision)
Per pipe performance for X1800 compared to 7800GTX = 384.1 / 412 * 100 = 93.2%
Per pipe performance for X1900 compared to 7800GTX = 473.4 / 412 * 100 = 114.9%
Frozen Glass (full precision)
Per pipe performance for X1800 compared to 7800GTX = 834.2/713* 100 = 117%
Per pipe performance for X1900 compared to 7800GTX = 866.9 / 713 * 100 = 121%
G70 wins one test by 7% over X1800 and loses the other by 17% per-pipe per clock
By the same metric it loses to X1900 by 15% and 21% respectively.
And now the "ALU intensive" versions
PS2 parallax mapping (partial precision)
X1800 at same clock rate as G70 with same pipe count = 256 * 550 / 625 * 24 / 16 = 338 fps
Per pipe performance for X1800 compared to 7800GTX = 338.1 / 470* 100 = 71.9%
X1900 at same clock rate as G70 with same pipe count = 619 * 550/650 * 24/16 = 785.7 fps
Per pipe performance for X1900 compared to 7800GTX = 785.7 / 470 * 100 = 167.2%
Frozen Glass (partial precision)
X1800 at same clock rate as G70 with same pipe count = 663 * 550 / 625 * 24 / 16 = 875.2 fps
Per pipe performance for X1800 compared to 7800GTX = 875.2/877* 100 = 99.8%
X1900 at same clock rate as G70 with same pipe count = 1035 * 550/650 * 24/16 = 1313.7 fps
Per pipe performance for X1900 compared to 7800GTX = 1313.7 / 877 * 100 = 149.8%
At partial precision per-pipe per-clock G70 wins one test against X1800 (which runs at full precision) by around 40%, and basically ties the other case.
By the same metric it loses both tests against X1900 by 67% in one test and 50% in the other.
PS2 parallax mapping (full precision)
Per pipe performance for X1800 compared to 7800GTX = 338.1 / 353 * 100 = 95.8%
Per pipe performance for X1900 compared to 7800GTX = 785.7 / 353 * 100 = 222.6%
Frozen Glass (full precision)
Per pipe performance for X1800 compared to 7800GTX = 875.2/773* 100 = 113%
Per pipe performance for X1900 compared to 7800GTX = 1313.7 / 773 * 100 = 170%
At full precision G70 trades wins in these tests with X1800 in per-pipe per-clock performance.
G70 loses to an X1900 by 120% in one test and 70% in the other test by the same metric.
Uttar said:Personally I would tend to believe that in 3:1 ALU:TEX ratio games, it is a reasonable estimation that to say one of NVIDIA's 24PS pipeline is equivalent to one of ATI's 48PS pipelines. This is because NVIDIA's pipelines can do VERY slightly more per clock, and you can roughly imagine the texturing operation every 3 clocks wasting that back.
Now, on the other hand, if you decrease the ALU:TEX ratio, NVIDIA's texturing abilities increase while their arithmetic ones decrease, which gives them an obvious advantage. So below that 1:3, you'd conceputalize each of NVIDIA's pipelines to do more and more than ATI's "pipelines", up until the theorical point of 1:0 and below where it'd become a (24/16) performance ratio between NVIDIA and ATI (DX7-era games, and some DX8-era ones).
Now, what's more interesting is what happens when the ALU:TEX goes beyond 3:1. Interestingly enough, NVIDIA's ALU1 gets less and less asked to do texture addressing, so their arithmetic power per-pipeline begins to surpass that of ATI's more. Obviously, they won't reach the equivalent of ATI's 48 pipelines, but perhaps 28-30 quite easily. Which obviously is why NVIDIA doesn't get beaten by 2-2.5x in purely arithmetic tests. Obviously, 3:1 is NVIDIA's weakness, but it gets less dramatic not only below that rato, but also above it.
Jawed said:Generally I agree - the NVidia pipeline appears "more flexible", able to gracefully trade texturing and ALU proportions.
From these particular tests I don't see any indication that G70's per-pipe shader architecture scales better than X1900 when the texture instruction count is high, but I see plenty of indications that X1900's shading performance advantage over G70 increases significantly per pipe as the shaders become ALU intensive. I don't see where any conclusion that a G70 pipeline behaviour is significantly more 'graceful' in it's scaling in either direction can be derived.
In these particular tests I see very little indication that a G70 pipeline running at equivalent (full) precision can outperform that of even an R520 by any significant margin, let alone an R580. There are evidently some cases where it can do quite well against R520 when it is allowed to run in partial precision against the R520 running at full precision.
The dynamic branching performance results speak for themselves.
[edit] Added analysis of some more of the quoted tests, and cleaned it up.[/edit]
Last edited by a moderator: