What can be defined as an ALU exactly?

Discussion in 'Architecture and Products' started by Ailuros, Feb 17, 2006.

  1. ChrisRay

    ChrisRay <span style="color: rgb(124, 197, 0)">R.I.P. 1983-
    Veteran

    Joined:
    Nov 25, 2002
    Messages:
    2,234
    Likes Received:
    26
    But mintmaster. The second ALU already existed on the NV4x as well. The addition was the MADD capabilities to the ALU. The pipeline change was not that radical from the Nv4x. You shouldnt expect 2x the performance.
     
    #21 ChrisRay, Feb 18, 2006
    Last edited by a moderator: Feb 18, 2006
  2. Ailuros

    Ailuros Epsilon plus three
    Legend Subscriber

    Joined:
    Feb 7, 2002
    Messages:
    9,511
    Likes Received:
    224
    Location:
    Chania
    Another gentleman from 3DCenter named Coda measured a worst case scenario of 20% but that's not small either. Demirug definitely would have more data on it and my memory is pretty vague on the conversation. If someone would want to stall it it shouldn't be too hard IMHO, question is if such code would behave that much better on competing products.
     
  3. Ailuros

    Ailuros Epsilon plus three
    Legend Subscriber

    Joined:
    Feb 7, 2002
    Messages:
    9,511
    Likes Received:
    224
    Location:
    Chania

    http://www.digit-life.com/articles2/video/r580-part2.html
     
  4. Mintmaster

    Veteran

    Joined:
    Mar 31, 2002
    Messages:
    3,897
    Likes Received:
    87
    Yes, I saw that. What's your point? The G70 scores are for 550MHz, 24 shader pipes and 24 texture units.

    In the Cook-Torrence lighting test, G70's shader pipe is 10% faster than R520's.
    In the 3-light Blinn test, G70's shader pipe is 32% faster than R520's.
    In the parallax mapping test, G70's shader pipe is 5% faster than R520's.
    In the frozen glass shader test, G70's shader pipe is 12% slower than R520's.
    Steep parallax mapping and fur (PS3.0) were blowouts, naturally.
    (All scores are 32-bit. These are the toughest shaders they threw at the cards. The first three show >2x boost on R580 over R520)

    Clearly the differences are much closer to zero than 100% or even 50%. Counting MADD issue rate is a much poorer gauge of arithmetic performance than multiplying shader pipelines by frequency.
     
  5. Mintmaster

    Veteran

    Joined:
    Mar 31, 2002
    Messages:
    3,897
    Likes Received:
    87
    I know. That's why comparing MADD rate is rather silly.
     
  6. ChrisRay

    ChrisRay <span style="color: rgb(124, 197, 0)">R.I.P. 1983-
    Veteran

    Joined:
    Nov 25, 2002
    Messages:
    2,234
    Likes Received:
    26
    Then how do you draw the conclusion the second ALU doesnt do very much? The additional MADD capabilities were an improvement to the second ALU. But its always been there. I dont see how you have come to the conclusion that Nvidias second ALU isnt doing much. ((or I just misread what you were saying)) As the shader throughput of the Nv4x and G70 has always been fairly exceptional IMO. And the G70 is just more efficient in many cases due to an improvement to the primary ALU.
     
    #26 ChrisRay, Feb 19, 2006
    Last edited by a moderator: Feb 19, 2006
  7. trinibwoy

    trinibwoy Meh
    Legend

    Joined:
    Mar 17, 2004
    Messages:
    12,059
    Likes Received:
    3,119
    Location:
    New York
    What improvement?
     
  8. Luminescent

    Veteran

    Joined:
    Aug 4, 2002
    Messages:
    1,036
    Likes Received:
    0
    Location:
    Miami, Fl
    Yeah, that's true; one of the fragment shaders/ALUs handles texture processing instructions per pixel processor, but since there are 24 pixel processors on G70 and they're counting units based on the total number of pixel processors available, the statement should read:
     
    #28 Luminescent, Feb 19, 2006
    Last edited by a moderator: Feb 19, 2006
  9. ChrisRay

    ChrisRay <span style="color: rgb(124, 197, 0)">R.I.P. 1983-
    Veteran

    Joined:
    Nov 25, 2002
    Messages:
    2,234
    Likes Received:
    26

    The MADD capabilities :)
     
  10. Moloch

    Moloch God of Wicked Games
    Veteran

    Joined:
    Jun 20, 2002
    Messages:
    2,981
    Likes Received:
    72
    How dare you :lol:
    Oh and you should know better than to say the G70 has 48 alus.. since you could say the R580 has 96 :D
    Since I was talking about full alus and you knew that, or should have, you should not have an issue with me saying the G70 has 24.
     
    #30 Moloch, Feb 19, 2006
    Last edited by a moderator: Feb 19, 2006
  11. Ailuros

    Ailuros Epsilon plus three
    Legend Subscriber

    Joined:
    Feb 7, 2002
    Messages:
    9,511
    Likes Received:
    224
    Location:
    Chania
    My point since my initial post was that counting ALUs (depending on perspective) like most do is nonsense, without analyzing the deeper aspects of each pipeline and yes I know it didn't come as clearly across as I would had wanted to.

    I myself admit to have stepped into the trap initially of not encounting ADDs on R5x0, until I got several reminders over it and sat down with a friend and had it analyzed for me.

    G70 was/is definitely no slouch in terms of floating point performance especially for the timeframe it was released.

    Since the cutout of this thread doesn't include the initial comment I reacted to, here once more:

    Bon apetit, if you look at architectures in that fashion :roll:
     
  12. Ailuros

    Ailuros Epsilon plus three
    Legend Subscriber

    Joined:
    Feb 7, 2002
    Messages:
    9,511
    Likes Received:
    224
    Location:
    Chania
    You're not suggesting that those latter 24 cannot do anything else besides handling texture OPs in parallel are you?
     
  13. trinibwoy

    trinibwoy Meh
    Legend

    Joined:
    Mar 17, 2004
    Messages:
    12,059
    Likes Received:
    3,119
    Location:
    New York
    Oh.
     
    #33 trinibwoy, Feb 19, 2006
    Last edited by a moderator: Feb 19, 2006
  14. Arun

    Arun Unknown.
    Legend

    Joined:
    Aug 28, 2002
    Messages:
    5,023
    Likes Received:
    302
    Location:
    UK
    Small reminder: MADD+MADD is very unlikely to ever happen on G70 because of register file restrictions. The original NV40 was (MUL or TEX) + MAD. The primary advantage of G70's MAD+MAD is being able to do single-cycle LERPs like ATI, although only when there's no texturing (SUB&MAD), and having a lot more flexibility when it comes to instruction reordering and Vec2+Vec2/Vec3+Vec1 optimizations (part of the advantages in the G70 pipeline for that last point are, however, afaik unrelated).

    Personally I would tend to believe that in 3:1 ALU:TEX ratio games, it is a reasonable estimation that to say one of NVIDIA's 24PS pipeline is equivalent to one of ATI's 48PS pipelines. This is because NVIDIA's pipelines can do VERY slightly more per clock, and you can roughly imagine the texturing operation every 3 clocks wasting that back.

    Now, on the other hand, if you decrease the ALU:TEX ratio, NVIDIA's texturing abilities increase while their arithmetic ones decrease, which gives them an obvious advantage. So below that 1:3, you'd conceputalize each of NVIDIA's pipelines to do more and more than ATI's "pipelines", up until the theorical point of 1:0 and below where it'd become a (24/16) performance ratio between NVIDIA and ATI (DX7-era games, and some DX8-era ones).

    Now, what's more interesting is what happens when the ALU:TEX goes beyond 3:1. Interestingly enough, NVIDIA's ALU1 gets less and less asked to do texture addressing, so their arithmetic power per-pipeline begins to surpass that of ATI's more. Obviously, they won't reach the equivalent of ATI's 48 pipelines, but perhaps 28-30 quite easily. Which obviously is why NVIDIA doesn't get beaten by 2-2.5x in purely arithmetic tests. Obviously, 3:1 is NVIDIA's weakness, but it gets less dramatic not only below that rato, but also above it.

    As for what happened to the G71 then, and G73 at the same time: do consider the fact G70's scheduler was changed, compared to NV40's, in order to divide the batches between PS pipelines... Now obviously, that cost them transistors, and the only use we can see of it right now is slightly less disastrous dynamic branching performance. But think about what flexibility that gives you, when it comes to future desgins...

    Sadly, I haven't had any reliable confirmation of this train of thought, so just take it as speculation for the time being.


    Uttar
     
  15. Luminescent

    Veteran

    Joined:
    Aug 4, 2002
    Messages:
    1,036
    Likes Received:
    0
    Location:
    Miami, Fl
    No, I'm not suggesting this. SU0 can perform a 4 component DIV or MADD and other combinations when there are no tex ops in a given clock; however it is limited during a tex instruction (perhaps due to the fact that the texture fetch unit has limited connections to the dispatch unit and must use the data pathways of the MADD units), although the MADDs and SFU might be able to modify the texture data before it's sent to the texture fetch unit (assuming my previous assumption about the reason for SU0's limitation during tex ops is correct).

    Perhaps a more accurate rendition of the statement would be:
    The statement does not mention that the 24 are completely limited to tex ops, since it is presupposed that they can go beyond tex ops (i.e., if all 48 have the same capabilities, albeit for tex modification, and are fragment shaders, they can go beyond tex ops). In addition, it doesn't specify that the 24 of them are required for tex ops all the time.
     
    #35 Luminescent, Feb 19, 2006
    Last edited by a moderator: Feb 19, 2006
  16. Jawed

    Legend

    Joined:
    Oct 2, 2004
    Messages:
    11,716
    Likes Received:
    2,137
    Location:
    London
    I disagree. While the four FP32s restriction does make dual-issued FP32 MADDs pretty unlikely (two operands need to be shared by both MADDs), partial precision MADDs seem fairly viable (there are plenty of them in 3DMk06) and won't bust the register bandwidth limit even when all operands are different.

    2+2 and 3+1 are features of NV40, too, so it's a fairly subtle advantage in this respect for G7x.

    I tend to agree, but NVidia's architectures seem to be more sensitive to overall register count - as long as the register count is no more than about 4 or 5 FP32s then they're OK. So they become very much dependent on being able to use _PP to maintain performance. Which seems viable as shorter shaders prolly won't reveal FP16-precision errors.

    Yep, I think this is where the heavy advantage for NV40 and G70 fragment pipelines comes from, with so few games having much arithmetic intensity.

    Generally I agree - the NVidia pipeline appears "more flexible", able to gracefully trade texturing and ALU proportions. But I think the true cost transpires in heavy register (and/or FP32 precision) usage.

    The only other thing that's worth noting is that the 3:1 ALU:TEX thing has become a little muddled, as far as I can tell. ATI was recommending 3:1 for R420. To me this means that R580 needs about 9:1 to flourish. The 3:1 ratio in R420 seems to be a function of the latency-hiding capability of the fragment pipeline (i.e. thread size), with the partially decoupled texturing providing a fair degree of texture "pre-fetching", though limited by R420's "stalling" upon dependent texturing. With fairly intensive texturing in most games, I think it's fair to say R420 prolly never saw much in the way of 3:1 until, ahem, after R520 had released, and so analysis of this point in respect of R420 hasn't happened...

    Well, that's my interpretation, anyway.

    Truely intense arithmetic tests seem to be all over the shop:

    http://www.digit-life.com/articles2/video/3dmark06/3dmark06_11.html

    which shows a 35% advantage per fragment pipe for G7x.

    The two PS3 tests (Steep Parallax Mapping and Fur) show the opposite, though:

    http://www.digit-life.com/articles2/video/r580-part2.html

    27% and 18% advantage per pipe in favour of R580 - but they prolly make use of dynamic branching as a performance tweak.

    The PS2 tests on that page, Parallax Mapping and Frozen glass show a heavy dependency on _PP for G70. In FP32, though, the former shows a 35% advantage for G70 while the latter shows a 79% advantage.

    (7800GTX-512 assumed to be 550MHz and R580 assumed to be 650MHz.)

    Jawed
     
  17. Demirug

    Veteran

    Joined:
    Dec 8, 2002
    Messages:
    1,326
    Likes Received:
    69
  18. Geo

    Geo Mostly Harmless
    Legend

    Joined:
    Apr 22, 2002
    Messages:
    9,116
    Likes Received:
    215
    Location:
    Uffda-land
    Won't the limitation of the ROPs kick in at some point in that scenario tho?
     
  19. Demirug

    Veteran

    Joined:
    Dec 8, 2002
    Messages:
    1,326
    Likes Received:
    69
    Only in cases were you have only one texturelayer and need only the bilinear filter. But in this cases you will have a bandwidth problem too.
     
  20. Arun

    Arun Unknown.
    Legend

    Joined:
    Aug 28, 2002
    Messages:
    5,023
    Likes Received:
    302
    Location:
    UK
    No, not when multitexturing/using better filtering. I'd rather fear the memory bandwidth personally then, though (unless the textures are sufficiently low-res & compressed, and the resolution is sufficiently high to improve framebuffer compression - both of which are in fact very likely when playing an old game on a new high-end card).

    Uttar
    EDIT: Damn, Demirug beat me to it ;)
     
Loading...

Share This Page

  • About Us

    Beyond3D has been around for over a decade and prides itself on being the best place on the web for in-depth, technically-driven discussion and analysis of 3D graphics hardware. If you love pixels and transistors, you've come to the right place!

    Beyond3D is proudly published by GPU Tools Ltd.
Loading...