What can be defined as an ALU exactly?

andypski · Feb 19, 2006

Apologies for a long post with lots of numbers coming up ahead...

Jawed said:
Truely intense arithmetic tests seem to be all over the shop:
http://www.digit-life.com/articles2/video/3dmark06/3dmark06_11.html
which shows a 35% advantage per fragment pipe for G7x.

I must admit to having no idea how you are deriving that result. Are you comparing A:B performance of a single ALU?

Taking the results for the 1024x768 test case shown, and assuming that in such an ALU limited test the scaling is near-linear with engine clock rate I make the scaled X1900 performance at 550 MHz to be something like 97.6 fps compared to 66.4 for G70. Scaling by (24/16) for number of fragment pipes means that with 24 pipes the R580 would theoretically perform at 146.4 frames per second, or 2.2 times the performance per shader pipe when compared to G70.

Comparing X1800 against 7800GTX, and scaling similarly by clock rate I make the scaled X1800 performance 40.9 frames per second at 550 MHz. Scaling by 24/16 for number of pipes would give 61.35 frames per second, so scaling to equal clock rates and pipe counts it would appear to me that the G70 fragment pipeline is performing about 8% better per clock on this test than an X1800. Now, given the fact that G70 supposedly has an entire additional MAD unit, and that this test is very heavy on the ALU instructions, that doesn't sound like a huge delta to me.

In these numbers I am ignoring any potential performance gains for the 7800 from its higher memory clock - the effects in a heavily ALU limited test are probably small.

Where does your figure of a 35% advantage per fragment pipe of the G70 come from?

The two PS3 tests (Steep Parallax Mapping and Fur) show the opposite, though:
http://www.digit-life.com/articles2/video/r580-part2.html
27% and 18% advantage per pipe in favour of R580 - but they prolly make use of dynamic branching as a performance tweak.

Performing the same analysis as above I get the following -

Steep Parallax mapping
X1800 at same clock rate as G70 with same pipe count = 56 * 550 / 625 * 24 / 16 = 73.92 fps
Per pipe performance for X1800 compared to 7800GTX = 73.92/22 * 100 = 336%
X1900 at same clock rate as G70 with same pipe count = 66 * 550 / 650 * 24 / 16 = 83.76 fps
Per pipe performance for X1900 compared to 7800GTX = 83.76/22 * 100 = 380%

Procedural Fur
X1800 at same clock rate as G70 with same pipe count = 25 * 550 / 625 * 24 / 16 = 33 fps
Per pipe performance for X1800 compared to 7800GTX = 33 / 9 * 100 = 366%
X1900 apparently has the same performance on this test as X1800 (unusual, but possible if it's very branch-intensive)
Per pipe performance for X1900 compared to 7800GTX = 625/650 * 366 = 352%

The PS2 tests on that page, Parallax Mapping and Frozen glass show a heavy dependency on _PP for G70. In FP32, though, the former shows a 35% advantage for G70 while the latter shows a 79% advantage.

Let's look at the "texture intensive" tests, note that since these two tests are apparently texturing intensive the 7800GTX's higher memory bandwidth is also probably coming into play quite significantly in these performance figures, but I have not accounted for this in my analysis:

PS2 parallax mapping (partial precision)
X1800 at same clock rate as G70 with same pipe count = 291 * 550 / 625 * 24 / 16 = 384.1 fps
Per pipe performance for X1800 compared to 7800GTX = 384.1 / 462 * 100 = 83.1%
X1900 at same clock rate as G70 with same pipe count = 373 * 550/650 * 24/16 = 473.4 fps
Per pipe performance for X1900 compared to 7800GTX = 473.4 / 462 * 100 = 102.5%

Frozen Glass (partial precision)
X1800 at same clock rate as G70 with same pipe count = 632 * 550 / 625 * 24 / 16 = 834.2 fps
Per pipe performance for X1800 compared to 7800GTX = 834.2/766* 100 = 109%
X1900 at same clock rate as G70 with same pipe count = 683 * 550/650 * 24/16 = 866.9 fps
Per pipe performance for X1900 compared to 7800GTX = 866.9 / 766 * 100 = 113%

G70 wins one test at partial precision by about 20% and loses the other by 9% against an X1800 per-clock per-pipe
By the same metric it loses by 2.5% in one test and 13% in the other against X1900

PS2 parallax mapping (full precision)
Per pipe performance for X1800 compared to 7800GTX = 384.1 / 412 * 100 = 93.2%
Per pipe performance for X1900 compared to 7800GTX = 473.4 / 412 * 100 = 114.9%

Frozen Glass (full precision)
Per pipe performance for X1800 compared to 7800GTX = 834.2/713* 100 = 117%
Per pipe performance for X1900 compared to 7800GTX = 866.9 / 713 * 100 = 121%

G70 wins one test by 7% over X1800 and loses the other by 17% per-pipe per clock
By the same metric it loses to X1900 by 15% and 21% respectively.

And now the "ALU intensive" versions

PS2 parallax mapping (partial precision)
X1800 at same clock rate as G70 with same pipe count = 256 * 550 / 625 * 24 / 16 = 338 fps
Per pipe performance for X1800 compared to 7800GTX = 338.1 / 470* 100 = 71.9%
X1900 at same clock rate as G70 with same pipe count = 619 * 550/650 * 24/16 = 785.7 fps
Per pipe performance for X1900 compared to 7800GTX = 785.7 / 470 * 100 = 167.2%

Frozen Glass (partial precision)
X1800 at same clock rate as G70 with same pipe count = 663 * 550 / 625 * 24 / 16 = 875.2 fps
Per pipe performance for X1800 compared to 7800GTX = 875.2/877* 100 = 99.8%
X1900 at same clock rate as G70 with same pipe count = 1035 * 550/650 * 24/16 = 1313.7 fps
Per pipe performance for X1900 compared to 7800GTX = 1313.7 / 877 * 100 = 149.8%

At partial precision per-pipe per-clock G70 wins one test against X1800 (which runs at full precision) by around 40%, and basically ties the other case.
By the same metric it loses both tests against X1900 by 67% in one test and 50% in the other.

PS2 parallax mapping (full precision)
Per pipe performance for X1800 compared to 7800GTX = 338.1 / 353 * 100 = 95.8%
Per pipe performance for X1900 compared to 7800GTX = 785.7 / 353 * 100 = 222.6%

Frozen Glass (full precision)
Per pipe performance for X1800 compared to 7800GTX = 875.2/773* 100 = 113%
Per pipe performance for X1900 compared to 7800GTX = 1313.7 / 773 * 100 = 170%

At full precision G70 trades wins in these tests with X1800 in per-pipe per-clock performance.
G70 loses to an X1900 by 120% in one test and 70% in the other test by the same metric.

Uttar said:
Personally I would tend to believe that in 3:1 ALU:TEX ratio games, it is a reasonable estimation that to say one of NVIDIA's 24PS pipeline is equivalent to one of ATI's 48PS pipelines. This is because NVIDIA's pipelines can do VERY slightly more per clock, and you can roughly imagine the texturing operation every 3 clocks wasting that back.

Now, on the other hand, if you decrease the ALU:TEX ratio, NVIDIA's texturing abilities increase while their arithmetic ones decrease, which gives them an obvious advantage. So below that 1:3, you'd conceputalize each of NVIDIA's pipelines to do more and more than ATI's "pipelines", up until the theorical point of 1:0 and below where it'd become a (24/16) performance ratio between NVIDIA and ATI (DX7-era games, and some DX8-era ones).

Now, what's more interesting is what happens when the ALU:TEX goes beyond 3:1. Interestingly enough, NVIDIA's ALU1 gets less and less asked to do texture addressing, so their arithmetic power per-pipeline begins to surpass that of ATI's more. Obviously, they won't reach the equivalent of ATI's 48 pipelines, but perhaps 28-30 quite easily. Which obviously is why NVIDIA doesn't get beaten by 2-2.5x in purely arithmetic tests. Obviously, 3:1 is NVIDIA's weakness, but it gets less dramatic not only below that rato, but also above it.

Jawed said:
Generally I agree - the NVidia pipeline appears "more flexible", able to gracefully trade texturing and ALU proportions.

From these particular tests I don't see any indication that G70's per-pipe shader architecture scales better than X1900 when the texture instruction count is high, but I see plenty of indications that X1900's shading performance advantage over G70 increases significantly per pipe as the shaders become ALU intensive. I don't see where any conclusion that a G70 pipeline behaviour is significantly more 'graceful' in it's scaling in either direction can be derived.

In these particular tests I see very little indication that a G70 pipeline running at equivalent (full) precision can outperform that of even an R520 by any significant margin, let alone an R580. There are evidently some cases where it can do quite well against R520 when it is allowed to run in partial precision against the R520 running at full precision.

The dynamic branching performance results speak for themselves.

[edit] Added analysis of some more of the quoted tests, and cleaned it up.[/edit]

Xmas · Feb 19, 2006

Jawed is obviously comparing 24 "shader pipelines" to 48 "shader pipelines".

trinibwoy · Feb 19, 2006

andypski said:
Scaling by (24/16) for number of fragment pipes

There's where you've gone horribly wrong. How in the world can you quote the number 16 for R580 when talking about shader performance? Even ATi quotes 48 shader pipelines for R580.

Bob · Feb 19, 2006

Demirug said:
The “technical“ version looks different:

More details (but other names for the units):

That second diagram is of the NV3x shader pipe. The first one is so abstract that it applies equally well to NV20.

Jawed · Feb 19, 2006

Yep "48 fragment pipelines" in the case of R580, which seems reasonable if we're talking about arithmetic-intensive shaders, where texturing, hopefully, is never on the critical path.

---

Yes, those PS3 shaders (Steep Parallax Mapping and Fur) do have dynamic branching (gah, didn't read that earlier

), and they seem to be bound more by DB performance and side-effects of 1:1 or 3:1 (TMU utilisation?) than the pure fragment-rate. So for the purposes of evaluating "ALU-effectiveness", as it were, those two PS3 shaders aren't much good.

---

The SM2 shader results you worked from, Andy,(Parallax Mapping and Frozen Glass) are the texturing intensive results. The arithmetic intensive results are a little different, e.g.:

Full Precision:
PM: X1800XT = 95% of 7800GTX-512
FG: X1800XT = 113% of 7800GTX-512

Curiously R520 prefers the texture-intensive version of the PM synthetic (as does 7800GTX-512 in full precision).

Jawed

EDIT: ah, seems you've tweaked your posting, Andy...
EDIT2: clocks on X1800XT, sigh

Demirug · Feb 19, 2006

Bob said:
That second diagram is of the NV3x shader pipe. The first one is so abstract that it applies equally well to NV20.

NV3X/NV4X/G70 share all the same base design. The first is part of this design too as NV20 looks different.

andypski · Feb 19, 2006

trinibwoy said:
There's where you've gone horribly wrong. How in the world can you quote the number 16 for R580 when talking about shader performance?

Very easily.

I thought people were largely trying to compare how efficiently the two architectures deal with shaders with varying numbers of texture and ALU instructions, I used the numbers from R520 primarily to give a baseline for comparison. As such, if you want to look at an R520 'shader pipeline' compared to a G70 'shader pipeline' the easiest way to do it is to view each R520 pipeline as having one dedicated texture unit and one dedicated ALU, and each G70 pipeline as having one shared ALU/texture unit and one dedicated ALU. R520 has 16 of these pipelines, and G70 has 24.

Given the above set up you might imagine that each G70 pipeline would perform similarly to an R520 pipeline if you had a shader with an even ratio of texture to ALU (or a ratio where there is more texture than ALU), but as the ratio starts to favor ALU you might then expect it to behave more as a dual-ALU pipeline (and as such, if all ALUs are equal, you would look for it to scale to 2x the performance of R520 per-pipeline per-clock in these cases)

When comparing R580 and G70 I'm sure you could look at it in many ways, after all they are two different architectures, however in terms of capabilities and performance the easiest way to frame the comparison (in the same terms as used above) is to consider the R580 as having 16 fragment pipelines each with one texture unit and 3 dedicated ALUs. Again, when the comparison is stated in these terms R580 has 16 such 'pipelines' and G70 has 24. While this may not exactly reflect the realities of the architecture it is an easy and fairly accurate way to approach the problem and avoids the idea of having a fractional number of texture units per ALU).

Anyway, if you tell me the terms of how you would prefer to do the comparison, or how you think it should be expressed we can debate it in those terms (assuming that they are reasonable), but I believe that the way I've shown it above forms a reasonable basis for comparison. If you think otherwise then I might be so bold as to suggest that you might be getting it "horribly wrong".

Mintmaster · Feb 19, 2006

Jawed said:
Yep "48 fragment pipelines" in the case of R580, which seems reasonable if we're talking about arithmetic-intensive shaders, where texturing, hopefully, is never on the critical path.

Well seeing how R580 is 2.4x the per clock speed of R520 in the 3DMark06 shader, texturing speed obviously is somewhat of an issue. Saying G70 is 35% faster is misleading.

Uttar said:
Now, what's more interesting is what happens when the ALU:TEX goes beyond 3:1. Interestingly enough, NVIDIA's ALU1 gets less and less asked to do texture addressing, so their arithmetic power per-pipeline begins to surpass that of ATI's more. Obviously, they won't reach the equivalent of ATI's 48 pipelines, but perhaps 28-30 quite easily. Which obviously is why NVIDIA doesn't get beaten by 2-2.5x in purely arithmetic tests.

Why would G70 be beaten by a factor of 2.5 (without branching)? The maximum per-clock advantage you'd naively calculate is 48/24=2. The Cook-Torrence test is as purely arithmetic of a test as we have data for (with R580 at 2.8x R520), and it shows R580 at 1.7x the speed of G70. If you're talking about final numbers, you have 329/162 = 2.03x.

In the end, I think saying G70's shader pipe is much superior to ATI's shader pipe is wrong. It's not 2x, not 1.5x, but maybe 20% faster on average in arithmetic ability. Saying 48 vs. 24 for math (which slightly overstates ATI's advantage) while keeping in mind 16 vs. 24 for texturing (which slightly overstates NVidia's advantage) is a very good way of describing the situation.

There's no need to make it more complicated than that.

Geo · Feb 19, 2006

16 eXXXtreme pipes!

I miss my old pipes. They were fine pipes. I knew what they were. Others knew what I meant when I pointed at them too. Ah well.

Mintmaster · Feb 19, 2006

ChrisRay said:
Then how do you draw the conclusion the second ALU doesnt do very much? The additional MADD capabilities were an improvement to the second ALU. But its always been there. I dont see how you have come to the conclusion that Nvidias second ALU isnt doing much.

Okay, I didn't write that very well. There are two points I wanted to make. First, having an additional MADD per clock doesn't get you very much. Second, the G70 pipeline isn't much faster than a R520 pipeline most of the time.

Indeed, I don't know how G70 would perform without the second ALU, so I was wrong in the way I wrote that statement.

Mintmaster · Feb 20, 2006

Ailuros said:
My point since my initial post was that counting ALUs (depending on perspective) like most do is nonsense, without analyzing the deeper aspects of each pipeline and yes I know it didn't come as clearly across as I would had wanted to.

I myself admit to have stepped into the trap initially of not encounting ADDs on R5x0, until I got several reminders over it and sat down with a friend and had it analyzed for me.

G70 was/is definitely no slouch in terms of floating point performance especially for the timeframe it was released.

Since the cutout of this thread doesn't include the initial comment I reacted to, here once more:

OK, I have no problem with any of the points you mentioned here. Certainly the comment that you originally reacted to was rather silly, and NVidia made a very good architecture for NV40/G70.

My response was initially a reaction to this:

Ailuros said:
Tim said:

It does not make any sense at all to count MADDs exclusively

Click to expand...

More than counting supposed "half-ALUs" as you just did. Make it alternatively 36 FLOPs per SIMD channel on R580 vs. 16 FLOPs per SIMD channel on G70 and we have an agreement.

Counting MADD's doesn't make more sense than counting "half ALUs", and I've given data to back it up. If you want to say G70 has 48 ALUs, then it makes more sense to say R580 has 96 ALUs rather than 48 ALUs, at least when comparing math performance. Tim most certainly was not suggesting 96 versus 24.

Jawed · Feb 20, 2006

Mintmaster said:
The Cook-Torrence test is as purely arithmetic of a test as we have data for (with R580 at 2.8x R520), and it shows R580 at 1.7x the speed of G70. If you're talking about final numbers, you have 329/162 = 2.03x.

That seems like a good starting point, since the benchmark is scaling almost perfectly with ALU capacity on the same (R5xx) architecture. (Sigh, I skipped too far down the page

)

So comparing across architectures at FP32, per fragment, per clock:

X1900XTX is 86% compared to GTX-512 (or GTX-512 is 17% faster)
X1800XT is 91% compared to GTX-512 (or GTX-512 is 10% faster)

GTX-512 gets 39% faster with _PP, indicating that G70's pipeline is suffering a pretty severe loss in ALU utilisation at full precision. But even with that loss, per fragment and per clock the GTX-512 is holding its own.

Jawed

andypski · Feb 20, 2006

Jawed said:
That seems like a good starting point, since the benchmark is scaling almost perfectly with ALU capacity on the same (R5xx) architecture. (Sigh, I skipped too far down the page )

So comparing across architectures at FP32, per fragment, per clock:

X1900XTX is 86% compared to GTX-512 (or GTX-512 is 17% faster)

X1800XT is 91% compared to GTX-512 (or GTX-512 is 10% faster)

GTX-512 gets 39% faster with _PP, indicating that G70's pipeline is suffering a pretty severe loss in ALU utilisation at full precision. But even with that loss, per fragment and per clock the GTX-512 is holding its own.
Jawed

Ok - I'm now totally confused as to how you are trying to view G70 performance per fragment pipe in ALU-limited cases - are you viewing it as 24 pipelines with one ALU per pipeline (total 24 ALUs) or 24 pipelines with two ALUs (48 ALUs total)? If we claim that G70 basically has two 'full' instances of an ALU per pipeline then it would seem that you would expect it to behave like a part with 48 total 'ALUs' (whatever those are).

If you want to frame R580 in the the same terms - ie. that it has 48 fragment pipelines each with one 'ALU' for this comparison then we can do that -

Let's look at the Cook Torrence with partial precision first -

R580 performance = 332.4 * 550/650 = 281.3 fps
Per "ALU" performance for R580 versus G70 = 281.3/226.1 * 100 = 124%

So X1900's performance (per ALU) seems to be about 24% faster than G70

Now with full precision:

R580 versus G70 = 281.3/162.3 * 100 = 173.3%

So X1900's performance (per ALU) seems to be about 73% faster than G70.

I need to check through this later - I'm doing this in a hurry so I can't guarantee I haven't made some mistakes.

Arun · Feb 20, 2006

Mintmaster said:
The Cook-Torrence test is as purely arithmetic of a test as we have data for (with R580 at 2.8x R520), and it shows R580 at 1.7x the speed of G70. If you're talking about final numbers, you have 329/162 = 2.03x.

My apologies - the 2-2.5x factor was comparing final numbers, not per-clock ones. I indeed should have made this clearer, considering I was talking of architectures in the last few lines, then suddenly jumped to comparing final chips.

In the end, I think saying G70's shader pipe is much superior to ATI's shader pipe is wrong. It's not 2x, not 1.5x, but maybe 20% faster on average in arithmetic ability. Saying 48 vs. 24 for math (which slightly overstates ATI's advantage) while keeping in mind 16 vs. 24 for texturing (which slightly overstates NVidia's advantage) is a very good way of describing the situation.

There's no need to make it more complicated than that.

I have interest in having a better understanding of the performance though, rather than just saying "overall, it's the same fucking thing", even though I roughly agree with that sentiment. Perhaps I'm trying to look a bit too much into details, but last I heard, this is what B3D was all about - not stopping at the whole "zomg it got 48 pipelines!" thing. Ah well.

If anything however,the 1.7x per-clock above for G70 vs R580 is pretty much exactly what I explained above, as this is the "purely arithmetic" case where ATI's texture units are idle, and NVIDIA's ALUs aren't used for addressing. My point above was that if you had a ratio where ATI's unit usage was maximal, ATI will most likely win by a larger factor than in the "purely arithmetic" case, no matter how unintuitive it might seem - and cases that are nearer that situation than the "tex-bound" or the "full-alu" situations is perfectly possible in shader-bound games.

Uttar

overclocked · Feb 20, 2006

I guess the free FP16 norm really comes in very handy on G7x.
Btw is it only up to the coder though give the pp hints or are the compiler smart enough to decide on its own?

Jawed · Feb 20, 2006

andypski said:
Ok - I'm now totally confused as to how you are trying to view G70 performance per fragment pipe in ALU-limited cases - are you viewing it as 24 pipelines with one ALU per pipeline (total 24 ALUs) or 24 pipelines with two ALUs (48 ALUs total)? If we claim that G70 basically has two 'full' instances of an ALU per pipeline then it would seem that you would expect it to behave like a part with 48 total 'ALUs' (whatever those are).

I'm trying to view per-fragment arithmetic rate, with the ALU structure treated as a black box.

R5xx and G70 have such differing ALU structures within their pipelines, that's it's not possible to say that an FP32 MAD in both executes in the same time, for example. An "ATI-FLOP" in one pipeline isn't equal to an "NV-FLOP" in the other. (You could argue that 1 ATI-FLOP is about 1.21 NV-FLOP if you wanted to appease Jaws.)

Since NV40 and R420 appeared, we've known that "per fragment, per clock" the significantly more complex ALU architecture of the NVidia "superscalar" design gives it an advantage, particularly with relatively short shaders or with _PP. Even if it also runs at lower overall utilisation than the competing ATI architecture, the "peaks" of issuable-complexity in shaders allow it to claw back that lost utilisation. I dare say those peaks are quite frequent in most of today's games, with games like Far Cry and FEAR apparently being exceptions.

In other words, it's rare that NV40 or G70 pipeline acts like merely a single MAD-capable (3+1) architecture. The second ALU is genuinely making a reasonably siginficant difference (muddied, somewhat, by the flexibility of each ALU, e.g. 2+2). Obviously NVidia has a get out of jail free card, with so much code producing acceptable results in _PP, so the register bandwidth limitation doesn't hurt with current games.

The Cook-Torrance test genuinely seems to be the most arithmetic-intensive synthetic around, so the results of R520 versus G70 (R520 performing at 91% of G70) seems like a fair reflection, particularly as G70 is losing so much performance in full precision.

In R580 instruction decode, register fetch/store and render back-end are all ganged-together (as "quads"). In that sense it's definitely a four-quad architecture, like R520 (but with 12 fragments per "quad-pipeline" instead of 4).

If R580 is a 16-pipeline GPU (four "quads" of 12), then that makes NV40 a 4 pipeline GPU (one "quad" of 16), as all 16 fragments being shaded in NV40 have identical shader state (even if they're on different triangles). By the same argument, Xenos is a 3 pipeline GPU (three arrays of 16). In strict architectural terms, these descriptions hold sway and I won't argue with them.

But they entirely obfuscate the matter under discussion, per-fragment arithmetic rate.

Clearly R580 is a monster. I'm not arguing it isn't.

Jawed

andypski · Feb 20, 2006

Jawed said:
I'm trying to view per-fragment arithmetic rate, with the ALU structure treated as a black box.

Ok - now I understand what you're doing, although I'm still not entirely sure how useful comparing things in such a way actually is.

Thanks
- Andy.

Luminescent · Feb 20, 2006

Even if it's not a rather useful comparison, it's still a quite interesting one.

Jawed · Feb 20, 2006

andypski said:
Ok - now I understand what you're doing, although I'm still not entirely sure how useful comparing things in such a way actually is.

Yeah, any single parameter, analysed to death, seems only capable of misleading.

You guys with your GPU simulators are the lucky ones :!:

I'd love to know how a 2:1 R580 (instead of 3:1) would have performed in games - I suspect it would have been practically identical to a 3:1 R580.

I dare say we'll be waiting a long time before any games really stretch the 3:1 ratio.

Jawed

fellix · Feb 20, 2006

As it was mentoned before, the R580 shader core is still 16-pipe order but filled with more ALUs/quads in-line, so do you think that this would be a correct representation of the case.

What can be defined as an ALU exactly?

andypski

Xmas

Porous

trinibwoy

Meh

Bob

Jawed

Demirug

andypski

Mintmaster

Geo

Mostly Harmless

Mintmaster

Mintmaster

Jawed

andypski

Arun

Unknown.

overclocked

Jawed

andypski

Luminescent

Jawed

fellix

Similar threads