What can be defined as an ALU exactly?

Discussion in 'Architecture and Products' started by Ailuros, Feb 17, 2006.

  1. andypski

    Regular

    Joined:
    May 20, 2002
    Messages:
    584
    Likes Received:
    28
    Location:
    Santa Clara
    Apologies for a long post with lots of numbers coming up ahead...

    I must admit to having no idea how you are deriving that result. Are you comparing A:B performance of a single ALU?

    Taking the results for the 1024x768 test case shown, and assuming that in such an ALU limited test the scaling is near-linear with engine clock rate I make the scaled X1900 performance at 550 MHz to be something like 97.6 fps compared to 66.4 for G70. Scaling by (24/16) for number of fragment pipes means that with 24 pipes the R580 would theoretically perform at 146.4 frames per second, or 2.2 times the performance per shader pipe when compared to G70.

    Comparing X1800 against 7800GTX, and scaling similarly by clock rate I make the scaled X1800 performance 40.9 frames per second at 550 MHz. Scaling by 24/16 for number of pipes would give 61.35 frames per second, so scaling to equal clock rates and pipe counts it would appear to me that the G70 fragment pipeline is performing about 8% better per clock on this test than an X1800. Now, given the fact that G70 supposedly has an entire additional MAD unit, and that this test is very heavy on the ALU instructions, that doesn't sound like a huge delta to me.

    In these numbers I am ignoring any potential performance gains for the 7800 from its higher memory clock - the effects in a heavily ALU limited test are probably small.

    Where does your figure of a 35% advantage per fragment pipe of the G70 come from?

    Performing the same analysis as above I get the following -

    Steep Parallax mapping
    X1800 at same clock rate as G70 with same pipe count = 56 * 550 / 625 * 24 / 16 = 73.92 fps
    Per pipe performance for X1800 compared to 7800GTX = 73.92/22 * 100 = 336%
    X1900 at same clock rate as G70 with same pipe count = 66 * 550 / 650 * 24 / 16 = 83.76 fps
    Per pipe performance for X1900 compared to 7800GTX = 83.76/22 * 100 = 380%

    Procedural Fur
    X1800 at same clock rate as G70 with same pipe count = 25 * 550 / 625 * 24 / 16 = 33 fps
    Per pipe performance for X1800 compared to 7800GTX = 33 / 9 * 100 = 366%
    X1900 apparently has the same performance on this test as X1800 (unusual, but possible if it's very branch-intensive)
    Per pipe performance for X1900 compared to 7800GTX = 625/650 * 366 = 352%

    Let's look at the "texture intensive" tests, note that since these two tests are apparently texturing intensive the 7800GTX's higher memory bandwidth is also probably coming into play quite significantly in these performance figures, but I have not accounted for this in my analysis:

    PS2 parallax mapping (partial precision)
    X1800 at same clock rate as G70 with same pipe count = 291 * 550 / 625 * 24 / 16 = 384.1 fps
    Per pipe performance for X1800 compared to 7800GTX = 384.1 / 462 * 100 = 83.1%
    X1900 at same clock rate as G70 with same pipe count = 373 * 550/650 * 24/16 = 473.4 fps
    Per pipe performance for X1900 compared to 7800GTX = 473.4 / 462 * 100 = 102.5%

    Frozen Glass (partial precision)
    X1800 at same clock rate as G70 with same pipe count = 632 * 550 / 625 * 24 / 16 = 834.2 fps
    Per pipe performance for X1800 compared to 7800GTX = 834.2/766* 100 = 109%
    X1900 at same clock rate as G70 with same pipe count = 683 * 550/650 * 24/16 = 866.9 fps
    Per pipe performance for X1900 compared to 7800GTX = 866.9 / 766 * 100 = 113%

    G70 wins one test at partial precision by about 20% and loses the other by 9% against an X1800 per-clock per-pipe
    By the same metric it loses by 2.5% in one test and 13% in the other against X1900

    PS2 parallax mapping (full precision)
    Per pipe performance for X1800 compared to 7800GTX = 384.1 / 412 * 100 = 93.2%
    Per pipe performance for X1900 compared to 7800GTX = 473.4 / 412 * 100 = 114.9%

    Frozen Glass (full precision)
    Per pipe performance for X1800 compared to 7800GTX = 834.2/713* 100 = 117%
    Per pipe performance for X1900 compared to 7800GTX = 866.9 / 713 * 100 = 121%

    G70 wins one test by 7% over X1800 and loses the other by 17% per-pipe per clock
    By the same metric it loses to X1900 by 15% and 21% respectively.

    And now the "ALU intensive" versions

    PS2 parallax mapping (partial precision)
    X1800 at same clock rate as G70 with same pipe count = 256 * 550 / 625 * 24 / 16 = 338 fps
    Per pipe performance for X1800 compared to 7800GTX = 338.1 / 470* 100 = 71.9%
    X1900 at same clock rate as G70 with same pipe count = 619 * 550/650 * 24/16 = 785.7 fps
    Per pipe performance for X1900 compared to 7800GTX = 785.7 / 470 * 100 = 167.2%

    Frozen Glass (partial precision)
    X1800 at same clock rate as G70 with same pipe count = 663 * 550 / 625 * 24 / 16 = 875.2 fps
    Per pipe performance for X1800 compared to 7800GTX = 875.2/877* 100 = 99.8%
    X1900 at same clock rate as G70 with same pipe count = 1035 * 550/650 * 24/16 = 1313.7 fps
    Per pipe performance for X1900 compared to 7800GTX = 1313.7 / 877 * 100 = 149.8%

    At partial precision per-pipe per-clock G70 wins one test against X1800 (which runs at full precision) by around 40%, and basically ties the other case.
    By the same metric it loses both tests against X1900 by 67% in one test and 50% in the other.

    PS2 parallax mapping (full precision)
    Per pipe performance for X1800 compared to 7800GTX = 338.1 / 353 * 100 = 95.8%
    Per pipe performance for X1900 compared to 7800GTX = 785.7 / 353 * 100 = 222.6%

    Frozen Glass (full precision)
    Per pipe performance for X1800 compared to 7800GTX = 875.2/773* 100 = 113%
    Per pipe performance for X1900 compared to 7800GTX = 1313.7 / 773 * 100 = 170%

    At full precision G70 trades wins in these tests with X1800 in per-pipe per-clock performance.
    G70 loses to an X1900 by 120% in one test and 70% in the other test by the same metric.

    From these particular tests I don't see any indication that G70's per-pipe shader architecture scales better than X1900 when the texture instruction count is high, but I see plenty of indications that X1900's shading performance advantage over G70 increases significantly per pipe as the shaders become ALU intensive. I don't see where any conclusion that a G70 pipeline behaviour is significantly more 'graceful' in it's scaling in either direction can be derived.

    In these particular tests I see very little indication that a G70 pipeline running at equivalent (full) precision can outperform that of even an R520 by any significant margin, let alone an R580. There are evidently some cases where it can do quite well against R520 when it is allowed to run in partial precision against the R520 running at full precision.

    The dynamic branching performance results speak for themselves.

    [edit] Added analysis of some more of the quoted tests, and cleaned it up.[/edit]
     
    #41 andypski, Feb 19, 2006
    Last edited by a moderator: Feb 19, 2006
    Geo likes this.
  2. Xmas

    Xmas Porous
    Veteran Subscriber

    Joined:
    Feb 6, 2002
    Messages:
    3,344
    Likes Received:
    176
    Location:
    On the path to wisdom
    Jawed is obviously comparing 24 "shader pipelines" to 48 "shader pipelines".
     
  3. trinibwoy

    trinibwoy Meh
    Legend

    Joined:
    Mar 17, 2004
    Messages:
    12,059
    Likes Received:
    3,119
    Location:
    New York
    There's where you've gone horribly wrong. How in the world can you quote the number 16 for R580 when talking about shader performance? Even ATi quotes 48 shader pipelines for R580.
     
    Jawed likes this.
  4. Bob

    Bob
    Regular

    Joined:
    Apr 22, 2004
    Messages:
    424
    Likes Received:
    47
    That second diagram is of the NV3x shader pipe. The first one is so abstract that it applies equally well to NV20.
     
  5. Jawed

    Legend

    Joined:
    Oct 2, 2004
    Messages:
    11,716
    Likes Received:
    2,137
    Location:
    London
    Yep "48 fragment pipelines" in the case of R580, which seems reasonable if we're talking about arithmetic-intensive shaders, where texturing, hopefully, is never on the critical path.

    ---

    Yes, those PS3 shaders (Steep Parallax Mapping and Fur) do have dynamic branching (gah, didn't read that earlier :oops: ), and they seem to be bound more by DB performance and side-effects of 1:1 or 3:1 (TMU utilisation?) than the pure fragment-rate. So for the purposes of evaluating "ALU-effectiveness", as it were, those two PS3 shaders aren't much good.

    ---

    The SM2 shader results you worked from, Andy,(Parallax Mapping and Frozen Glass) are the texturing intensive results. The arithmetic intensive results are a little different, e.g.:

    Full Precision:
    PM: X1800XT = 95% of 7800GTX-512
    FG: X1800XT = 113% of 7800GTX-512

    Curiously R520 prefers the texture-intensive version of the PM synthetic (as does 7800GTX-512 in full precision).

    Jawed

    EDIT: ah, seems you've tweaked your posting, Andy...
    EDIT2: clocks on X1800XT, sigh
     
    #45 Jawed, Feb 19, 2006
    Last edited by a moderator: Feb 19, 2006
  6. Demirug

    Veteran

    Joined:
    Dec 8, 2002
    Messages:
    1,326
    Likes Received:
    69
    NV3X/NV4X/G70 share all the same base design. The first is part of this design too as NV20 looks different.
     
  7. andypski

    Regular

    Joined:
    May 20, 2002
    Messages:
    584
    Likes Received:
    28
    Location:
    Santa Clara
    Very easily.

    I thought people were largely trying to compare how efficiently the two architectures deal with shaders with varying numbers of texture and ALU instructions, I used the numbers from R520 primarily to give a baseline for comparison. As such, if you want to look at an R520 'shader pipeline' compared to a G70 'shader pipeline' the easiest way to do it is to view each R520 pipeline as having one dedicated texture unit and one dedicated ALU, and each G70 pipeline as having one shared ALU/texture unit and one dedicated ALU. R520 has 16 of these pipelines, and G70 has 24.

    Given the above set up you might imagine that each G70 pipeline would perform similarly to an R520 pipeline if you had a shader with an even ratio of texture to ALU (or a ratio where there is more texture than ALU), but as the ratio starts to favor ALU you might then expect it to behave more as a dual-ALU pipeline (and as such, if all ALUs are equal, you would look for it to scale to 2x the performance of R520 per-pipeline per-clock in these cases)

    When comparing R580 and G70 I'm sure you could look at it in many ways, after all they are two different architectures, however in terms of capabilities and performance the easiest way to frame the comparison (in the same terms as used above) is to consider the R580 as having 16 fragment pipelines each with one texture unit and 3 dedicated ALUs. Again, when the comparison is stated in these terms R580 has 16 such 'pipelines' and G70 has 24. While this may not exactly reflect the realities of the architecture it is an easy and fairly accurate way to approach the problem and avoids the idea of having a fractional number of texture units per ALU).

    Anyway, if you tell me the terms of how you would prefer to do the comparison, or how you think it should be expressed we can debate it in those terms (assuming that they are reasonable), but I believe that the way I've shown it above forms a reasonable basis for comparison. If you think otherwise then I might be so bold as to suggest that you might be getting it "horribly wrong".
     
    #47 andypski, Feb 19, 2006
    Last edited by a moderator: Feb 19, 2006
  8. Mintmaster

    Veteran

    Joined:
    Mar 31, 2002
    Messages:
    3,897
    Likes Received:
    87
    Well seeing how R580 is 2.4x the per clock speed of R520 in the 3DMark06 shader, texturing speed obviously is somewhat of an issue. Saying G70 is 35% faster is misleading.

    Why would G70 be beaten by a factor of 2.5 (without branching)? The maximum per-clock advantage you'd naively calculate is 48/24=2. The Cook-Torrence test is as purely arithmetic of a test as we have data for (with R580 at 2.8x R520), and it shows R580 at 1.7x the speed of G70. If you're talking about final numbers, you have 329/162 = 2.03x.


    In the end, I think saying G70's shader pipe is much superior to ATI's shader pipe is wrong. It's not 2x, not 1.5x, but maybe 20% faster on average in arithmetic ability. Saying 48 vs. 24 for math (which slightly overstates ATI's advantage) while keeping in mind 16 vs. 24 for texturing (which slightly overstates NVidia's advantage) is a very good way of describing the situation.

    There's no need to make it more complicated than that.
     
  9. Geo

    Geo Mostly Harmless
    Legend

    Joined:
    Apr 22, 2002
    Messages:
    9,116
    Likes Received:
    215
    Location:
    Uffda-land
    16 eXXXtreme pipes!

    I miss my old pipes. They were fine pipes. I knew what they were. Others knew what I meant when I pointed at them too. Ah well.
     
  10. Mintmaster

    Veteran

    Joined:
    Mar 31, 2002
    Messages:
    3,897
    Likes Received:
    87
    Okay, I didn't write that very well. There are two points I wanted to make. First, having an additional MADD per clock doesn't get you very much. Second, the G70 pipeline isn't much faster than a R520 pipeline most of the time.

    Indeed, I don't know how G70 would perform without the second ALU, so I was wrong in the way I wrote that statement.
     
  11. Mintmaster

    Veteran

    Joined:
    Mar 31, 2002
    Messages:
    3,897
    Likes Received:
    87
    OK, I have no problem with any of the points you mentioned here. Certainly the comment that you originally reacted to was rather silly, and NVidia made a very good architecture for NV40/G70.

    My response was initially a reaction to this:
    Counting MADD's doesn't make more sense than counting "half ALUs", and I've given data to back it up. If you want to say G70 has 48 ALUs, then it makes more sense to say R580 has 96 ALUs rather than 48 ALUs, at least when comparing math performance. Tim most certainly was not suggesting 96 versus 24.
     
  12. Jawed

    Legend

    Joined:
    Oct 2, 2004
    Messages:
    11,716
    Likes Received:
    2,137
    Location:
    London
    That seems like a good starting point, since the benchmark is scaling almost perfectly with ALU capacity on the same (R5xx) architecture. (Sigh, I skipped too far down the page :oops: )

    So comparing across architectures at FP32, per fragment, per clock:
    • X1900XTX is 86% compared to GTX-512 (or GTX-512 is 17% faster)
    • X1800XT is 91% compared to GTX-512 (or GTX-512 is 10% faster)
    GTX-512 gets 39% faster with _PP, indicating that G70's pipeline is suffering a pretty severe loss in ALU utilisation at full precision. But even with that loss, per fragment and per clock the GTX-512 is holding its own.

    Jawed
     
  13. andypski

    Regular

    Joined:
    May 20, 2002
    Messages:
    584
    Likes Received:
    28
    Location:
    Santa Clara
    Ok - I'm now totally confused as to how you are trying to view G70 performance per fragment pipe in ALU-limited cases - are you viewing it as 24 pipelines with one ALU per pipeline (total 24 ALUs) or 24 pipelines with two ALUs (48 ALUs total)? If we claim that G70 basically has two 'full' instances of an ALU per pipeline then it would seem that you would expect it to behave like a part with 48 total 'ALUs' (whatever those are).

    If you want to frame R580 in the the same terms - ie. that it has 48 fragment pipelines each with one 'ALU' for this comparison then we can do that -

    Let's look at the Cook Torrence with partial precision first -

    R580 performance = 332.4 * 550/650 = 281.3 fps
    Per "ALU" performance for R580 versus G70 = 281.3/226.1 * 100 = 124%

    So X1900's performance (per ALU) seems to be about 24% faster than G70

    Now with full precision:

    R580 versus G70 = 281.3/162.3 * 100 = 173.3%

    So X1900's performance (per ALU) seems to be about 73% faster than G70.

    I need to check through this later - I'm doing this in a hurry so I can't guarantee I haven't made some mistakes.
     
  14. Arun

    Arun Unknown.
    Legend

    Joined:
    Aug 28, 2002
    Messages:
    5,023
    Likes Received:
    302
    Location:
    UK
    My apologies - the 2-2.5x factor was comparing final numbers, not per-clock ones. I indeed should have made this clearer, considering I was talking of architectures in the last few lines, then suddenly jumped to comparing final chips.

    I have interest in having a better understanding of the performance though, rather than just saying "overall, it's the same fucking thing", even though I roughly agree with that sentiment. Perhaps I'm trying to look a bit too much into details, but last I heard, this is what B3D was all about - not stopping at the whole "zomg it got 48 pipelines!" thing. Ah well.

    If anything however,the 1.7x per-clock above for G70 vs R580 is pretty much exactly what I explained above, as this is the "purely arithmetic" case where ATI's texture units are idle, and NVIDIA's ALUs aren't used for addressing. My point above was that if you had a ratio where ATI's unit usage was maximal, ATI will most likely win by a larger factor than in the "purely arithmetic" case, no matter how unintuitive it might seem - and cases that are nearer that situation than the "tex-bound" or the "full-alu" situations is perfectly possible in shader-bound games.


    Uttar
     
  15. overclocked

    Veteran

    Joined:
    Oct 25, 2002
    Messages:
    1,317
    Likes Received:
    6
    Location:
    Sweden
    I guess the free FP16 norm really comes in very handy on G7x.
    Btw is it only up to the coder though give the pp hints or are the compiler smart enough to decide on its own?
     
  16. Jawed

    Legend

    Joined:
    Oct 2, 2004
    Messages:
    11,716
    Likes Received:
    2,137
    Location:
    London
    I'm trying to view per-fragment arithmetic rate, with the ALU structure treated as a black box.

    R5xx and G70 have such differing ALU structures within their pipelines, that's it's not possible to say that an FP32 MAD in both executes in the same time, for example. An "ATI-FLOP" in one pipeline isn't equal to an "NV-FLOP" in the other. (You could argue that 1 ATI-FLOP is about 1.21 NV-FLOP if you wanted to appease Jaws.)

    Since NV40 and R420 appeared, we've known that "per fragment, per clock" the significantly more complex ALU architecture of the NVidia "superscalar" design gives it an advantage, particularly with relatively short shaders or with _PP. Even if it also runs at lower overall utilisation than the competing ATI architecture, the "peaks" of issuable-complexity in shaders allow it to claw back that lost utilisation. I dare say those peaks are quite frequent in most of today's games, with games like Far Cry and FEAR apparently being exceptions.

    In other words, it's rare that NV40 or G70 pipeline acts like merely a single MAD-capable (3+1) architecture. The second ALU is genuinely making a reasonably siginficant difference (muddied, somewhat, by the flexibility of each ALU, e.g. 2+2). Obviously NVidia has a get out of jail free card, with so much code producing acceptable results in _PP, so the register bandwidth limitation doesn't hurt with current games.

    The Cook-Torrance test genuinely seems to be the most arithmetic-intensive synthetic around, so the results of R520 versus G70 (R520 performing at 91% of G70) seems like a fair reflection, particularly as G70 is losing so much performance in full precision.

    In R580 instruction decode, register fetch/store and render back-end are all ganged-together (as "quads"). In that sense it's definitely a four-quad architecture, like R520 (but with 12 fragments per "quad-pipeline" instead of 4).

    If R580 is a 16-pipeline GPU (four "quads" of 12), then that makes NV40 a 4 pipeline GPU (one "quad" of 16), as all 16 fragments being shaded in NV40 have identical shader state (even if they're on different triangles). By the same argument, Xenos is a 3 pipeline GPU (three arrays of 16). In strict architectural terms, these descriptions hold sway and I won't argue with them.

    But they entirely obfuscate the matter under discussion, per-fragment arithmetic rate.

    Clearly R580 is a monster. I'm not arguing it isn't.

    Jawed
     
  17. andypski

    Regular

    Joined:
    May 20, 2002
    Messages:
    584
    Likes Received:
    28
    Location:
    Santa Clara
    Ok - now I understand what you're doing, although I'm still not entirely sure how useful comparing things in such a way actually is.

    Thanks
    - Andy.
     
  18. Luminescent

    Veteran

    Joined:
    Aug 4, 2002
    Messages:
    1,036
    Likes Received:
    0
    Location:
    Miami, Fl
    Even if it's not a rather useful comparison, it's still a quite interesting one.
     
  19. Jawed

    Legend

    Joined:
    Oct 2, 2004
    Messages:
    11,716
    Likes Received:
    2,137
    Location:
    London
    Yeah, any single parameter, analysed to death, seems only capable of misleading.

    You guys with your GPU simulators are the lucky ones :!: I'd love to know how a 2:1 R580 (instead of 3:1) would have performed in games - I suspect it would have been practically identical to a 3:1 R580.

    I dare say we'll be waiting a long time before any games really stretch the 3:1 ratio.

    Jawed
     
  20. fellix

    Veteran

    Joined:
    Dec 4, 2004
    Messages:
    3,552
    Likes Received:
    514
    Location:
    Varna, Bulgaria
    As it was mentoned before, the R580 shader core is still 16-pipe order but filled with more ALUs/quads in-line, so do you think that this would be a correct representation of the case.
     
Loading...

Share This Page

  • About Us

    Beyond3D has been around for over a decade and prides itself on being the best place on the web for in-depth, technically-driven discussion and analysis of 3D graphics hardware. If you love pixels and transistors, you've come to the right place!

    Beyond3D is proudly published by GPU Tools Ltd.
Loading...