What can be defined as an ALU exactly?

Discussion in 'Architecture and Products' started by Ailuros, Feb 17, 2006.

  1. arjan de lumens

    Veteran

    Joined:
    Feb 10, 2002
    Messages:
    1,274
    Likes Received:
    50
    Location:
    gjethus, Norway
    Nope. At 500 Mhz with standard-cell logic and FP32 precision, even a single DOT3 alone is going to take 6 to 8 pipeline stages, not counting the steps associated with getting instructions/data in and out of the DOT3 execution unit. Texturing is about an order of magnitude worse, somewhat depending on the mechanism used to provide latency tolerance.
     
  2. Jawed

    Legend

    Joined:
    Oct 2, 2004
    Messages:
    10,873
    Likes Received:
    767
    Location:
    London
    • Look at midrange cards, e.g. RV410 versus NV43
    • At 90nm R580 and G71 are going to be within 10% of each other
    You've got me, I have no idea what Parhelia is - literally. Have you read Andy's posts?

    A problem arises because Andy described R580 as a 16-pipeline part with 12 fragments being shaded per "quad". That's effectively "per TMU quad".

    Because texturing is intrinsically a quad-based operation (as near as dammit) it makes sense for the TMUs to be arranged in quads, and since NV40 has tightly coupled ALUs and TMUs, everything goes together as four sets of quads.

    In NV40/G70 it's possible to put multiple triangles into a thread, and therefore there's very low risk of any single pipeline being "unused".

    R3xx...R4xx...R5xx appear to work strictly on one triangle per thread so small triangles do hurt these architectures somewhat. I'm guessing this is why a thread in these architectures (256 fragments) is smaller than in NV40 (4096) and G70 (1024).

    Well I tend to agree :smile:

    But in pure engineering terms, the ganging of fragment processing into quads and/or arrays (e.g. Xenos) as I've been describing, means there are many less actual pipelines than marketing would have you believe.

    Jawed
     
  3. KimB

    Legend

    Joined:
    May 28, 2002
    Messages:
    12,928
    Likes Received:
    230
    Location:
    Seattle, WA
    The Parhelia was Matrox' last high-end product. Released not long before the 9700 Pro, this was the last of the high-end DX8 products. Its primary claim to fame was the support for three displays (triple head gaming). With some games, you could make use of the three-display output for a panoramic view. It also supported "Fragment Anti-Aliasing" which was a method of selective supersampling where the card would only supersample those parts of the frame that were on the edges of objects (not all triangle edges: it attempted to separate out those which would cause aliasing, such as the silhouette of a mesh). The supersampling was 16x ordered-grid, and its performance hit was roughly on par with that of the GeForce4 Ti's 4x multisampling.

    Its performance was subpar (for the price and time), though, and thus was overshadowed when the DX9 cards started to show.

    If I remember correctly, it supported a large number of texture ops per pipeline, something like 3-4.
     
  4. Jawed

    Legend

    Joined:
    Oct 2, 2004
    Messages:
    10,873
    Likes Received:
    767
    Location:
    London
    A DP4 in Xenon takes 14 stages at 3.2GHz. At 600MHz I'd expect it to take way less than half the number of stages.

    And even with "long instructions" such as DP or RSQ, simpler ADD or MUL should run in 1 cycle.

    But that's a separate pipeline with its own startup cost (i.e. fetch from memory or L2 as required) - and with texture data in cache a bilinear operation is supposed to be 1 cycle.

    I was focussing on the NV40/G70 ALU pipeline and pointing out that fragment instructions are possibly staggered across the two shader units, rather than dual-issue occurring on a single fragment on both shader units.

    ---

    As a matter of interest, Xenon at 3.2GHz has a ~610 cycle L2 miss penalty with 700MHz GDDR3 as the memory (via Xenos's northbridge, i.e. extra delay). At 600MHz, that's about 115 clock cycles, which is about half the thread duration (256 clocks) for one instruction in G70.

    Jawed
     
  5. snk

    snk
    Newcomer

    Joined:
    Aug 10, 2002
    Messages:
    53
    Likes Received:
    2
    Location:
    Finland
    In traditional terms Parhelia was a 4x4 architechture, when NV30 was 4x2 and R300 was 8x1. Its performance was so low mainly bacause it lacked all Z-occlusion culling capabilities.
     
  6. Ailuros

    Ailuros Epsilon plus three
    Legend Subscriber

    Joined:
    Feb 7, 2002
    Messages:
    9,429
    Likes Received:
    181
    Location:
    Chania
    My memory is very vague on Parhelia, but didn't it have a TMU-array?
     
  7. stepz

    Newcomer

    Joined:
    Dec 11, 2003
    Messages:
    66
    Likes Received:
    3
    Outputted to where? I thought all of the above can output upto 16 fragments into framebuffer. :)
    Anyway, I think I understand what you mean, but that number seems pretty much meaningless, as it doesn't describe the architecture nor the performance. All this nonsense of ALU A is 1.5x ALU B seems as pointless as comparing the MIPS or IPC numbers of modern processors. There plain and simple isn't an IPC without qualifications. Either you compare performance (per chip or wall clock) running one or another shader or you try to specify the architecture in full detail. There just plain isn't any other relevant way to compare so differing architectures.
     
    Ailuros likes this.
  8. Xmas

    Xmas Porous
    Veteran Subscriber

    Joined:
    Feb 6, 2002
    Messages:
    3,305
    Likes Received:
    138
    Location:
    On the path to wisdom
    The "texturing pipeline" isn't separate in NV3x/NV4x/G7x. And bilinear filtering/sampling has 1 quad/cycle throughput, but it takes several cycles.

    It's a pipeline, so different stages are working on different quads.
     
  9. Jawed

    Legend

    Joined:
    Oct 2, 2004
    Messages:
    10,873
    Likes Received:
    767
    Location:
    London
  10. Ailuros

    Ailuros Epsilon plus three
    Legend Subscriber

    Joined:
    Feb 7, 2002
    Messages:
    9,429
    Likes Received:
    181
    Location:
    Chania
  11. Dave Baumann

    Dave Baumann Gamerscore Wh...
    Moderator Legend

    Joined:
    Jan 29, 2002
    Messages:
    14,081
    Likes Received:
    651
    Location:
    O Canada!
  12. fellix

    fellix Hey, You!
    Veteran

    Joined:
    Dec 4, 2004
    Messages:
    3,503
    Likes Received:
    420
    Location:
    Varna, Bulgaria
    A bit aside question about the tex caches in NV40 - is it true that L1 tex cache is common per single quad, like the L2 is for all of the quads?
     
  13. trinibwoy

    trinibwoy Meh
    Legend

    Joined:
    Mar 17, 2004
    Messages:
    10,537
    Likes Received:
    589
    Location:
    New York
    Oh, didn't know you worked for ATi :smile: And no, that's not what I'm doing at all. My 2 for G70 and 1.5 for R580 is down to the commonly referenced simplified MADD+MADD and MADD+ADD capabilities of each respective shader, notwithstanding architectural differences.

    What you've outlined above is exactly why per-shader performance is a much better metric than per-ALU performance since the numbers aren't obfuscated by the intricate architectural details.

    What you've still not addressed is your justification for considering R580 a 16-shader part, while at the same time considering G70 a 24-shader part. The only fair comparisons I can see are 6/12 (quads), 24/48 (shaders) or 48/48 (ALUs) (or 48/96 if we count the ADD) for G70/R580.

    And if you want to get into the discounting game, considering R580 a 48 ALU part still discounts the ADD of the first ALU. I think that's a very generous trade for Nvidia's mini-ALU.
     
    #93 trinibwoy, Feb 21, 2006
    Last edited by a moderator: Feb 21, 2006
  14. Bob

    Bob
    Regular

    Joined:
    Apr 22, 2004
    Messages:
    424
    Likes Received:
    47
    1. CPUs are not usually designed with standard cell libraries.
    2. CPUs are optimized almost exclusively for (effective) latency of operations, and clock speed.

    A lot more transistors are being used to implement that DP4 at 3.2 GHz than a DP4 on a GPU. A high-end GPU would also have significantly more capability to compute DP4s than Xenon.


    According to this diagram, yes
     
  15. JF_Aidan_Pryde

    Regular

    Joined:
    Feb 6, 2002
    Messages:
    601
    Likes Received:
    3
    Location:
    New York
    Okay but surely you can agree that by measuring math/second, we're measuring throughput which is everybit as important as math/clock. And when there's a large clock difference, it's especially important.

    A four pipeline card with four TMUs per pipe. It was cool while it lasted. :)

    I re-read Andy's post. He's basically dividing up the pipes by texturing units because of the odd numbers present in the R580. But I don't see how based on that you can say that the NV40 is a one pipeline card. It has the same number of shaders as texture units. So it's just a 16-pipe card.

    Can you provide more details? Don't texture units already fetch four samples at full speed? How does ganging four texture units together help?

    Do the triangles in the thread have to be physically adjacent to each other?

    Only if you define the GPU in terms of the number of shader states. And in terms of ganging pipes, the fragment rate is only reduced if it the ganging is serial; so long as there are 'n' shader units all outputting fragments in parallel, I think it's totally valid to describe the G70 as a 24 fragment-pipe part and the R580 as a 48 fragment-pipe part. Of course the contents of the pipeline deserve separate discussion. :)
     
  16. Jawed

    Legend

    Joined:
    Oct 2, 2004
    Messages:
    10,873
    Likes Received:
    767
    Location:
    London
    [posting this despite reservations of pissing into the wind]

    As far as I can tell that's simply indicating that the first shader unit is either doing shader arithmetic or it's calculating dependent texture addressing.

    The fundamental issue here is that if the TMU is truly in-line and the resulting pipeline was dozens if not hundreds of clocks long, then there'd be no need for threads (or they could consist of a few 10s of fragments, not hundreds as they actually do).

    As I showed earlier, typical GDDR3 fetch latency is easily hidden solely by per-quad-pipe thread size (~115 cycles of latency with 700MHz GDDR3 on a 600MHz GPU is easily hidden by a 256-cycle-per-instruction thread).

    Additionally we can clearly see in R3xx etc. that the semi-decoupled texturing of that architecture requires a specific texture address calculation ALU (which people continue to forget to count when "counting pipeline ALUs") which then feeds a texturing pipeline. So texturing proceeds asynchronously.

    [​IMG]

    Xenos, of course, is the model of fully decoupled texturing, but R5xx also achieves the same in the context of pixel shading:

    [​IMG]

    where the thread in the shader core will often not be the same as the thread in the corresponding texture unit.

    All that's happening in NV40/G70 is that there is no dedicated ALU for dependent texturing address calculations (so shader unit 1 is overloaded) - texturing itself proceeds asynchronously, with typical thread-sizes enabling the texture pipe to produce its results before the fragment returns to context and needs that result, 1, 2 or more instructions later - with bilinear filtering that will normally be the following instruction, a minimum of 256 cycles after the texture operation is commenced.

    Jawed

    EDIT: removed the bit about 32-fold versus 8-fold - sigh, brain attack...
     
    #96 Jawed, Feb 21, 2006
    Last edited by a moderator: Feb 21, 2006
  17. Jawed

    Legend

    Joined:
    Oct 2, 2004
    Messages:
    10,873
    Likes Received:
    767
    Location:
    London
    I think that 14-stage DP4 pipeline might actually be so long because it's an SMT architecture - so it could arguably be half that length :smile:

    A MADD in Xenon is 12 stages, as compared with 6 in Cell SPE (also at 3.2GHz). Maybe I should have used Cell SPE's vector pipeline for the comparison :oops:

    Jawed
     
  18. Jawed

    Legend

    Joined:
    Oct 2, 2004
    Messages:
    10,873
    Likes Received:
    767
    Location:
    London
    That might be what you are interested in, but it's not what I was interested in.

    Ah well, never mind, it's not an interesting perspective if you're not interested by the architecture, per se.

    Prolly a semantic thing, I'm simply saying that four texels together is the natural order of things:

    http://www.3dcenter.org/artikel/nv40_technik/index3_e.php

    Dunno! Maybe Bob will say. I doubt they do, since it's really about shader state.

    Jawed
     
  19. andypski

    Regular

    Joined:
    May 20, 2002
    Messages:
    584
    Likes Received:
    28
    Location:
    Santa Clara
    I believe I was originally looking more at "per-shader" performance earlier rather than per ALU, but people seemed to want to look at per-ALU for some reason.
    Funny - I thought I had addressed that, but I'll try again.

    We want to compare the respective architectures, and for the purposes of comparison we want to divide them up into chunks called 'shaders', each of which can execute a pixel shader program.

    What do you need to have in order to run a basic pixel shader program?

    - To be able to run a generic pixel shader program you require ALU and texture resources (and potentially flow control etc., but let's keep things relatively simple)

    So a 'unit' to run a pixel shader program needs ALU and texture resources, so I divide the respective architectures up evenly into chunks that meet the criteria and call them 'shaders'. This results in the following divisions -

    R580 - 16 "shaders", each with 1 texture resource and 3 ALU resources
    G70 - 24 "shaders", each with 1 texture/ALU resource and 1 dedicated ALU resource.

    Dividing things in this way doesn't necessarily have anything to do with the underlying architecture - it's just a way to form a basis for comparison. I guess you could actually pick _any_ basis, as long as your assumptions are consistent and correct, and perform a comparison.

    For example, if you want to look at the ALU only case we could choose to discount the texture resources entirely and choose to say that G70 has 48 ALU's and R580 has 48 ALUs, which is what I did in the earlier post about the Cook-Torrance shader performance.

    The problem then obviously comes back to the more thorny area of the debate which is "what is considered an ALU?". A lot of people seem to feel that it's an individual block that contains a MAD unit, which was how I framed it, but as you correctly point out this is not the only possibility by any means.

    I wasn't playing a discounting game - I was playing a simplifying game. :)

    When I initially started comparing things I was pretty content to work at a more abstract level - take each ALU chunk from each architecture as a black-box and simply compare the apparent execution characteristics on the supplied shaders. I was not really looking for why A was faster or slower than B, or whether A is more expensive in silicon than B, which are also interesting questions to answer.

    Mainly the interest seemed to be in comparing the performance characteristics of the ALUs of the two designs, which is what I did. Examining the exact tradeoffs that went into the different designs and their detailed behaviour would be far more complex. Both architectures have "ALUs" that are more complex and perform more operations than a simple MAD.

    Whether it's a generous trade or not would depend on what the full capabilities of that mini-ALU are (and things like the NRM unit, of course). We could equally say that allowing particular ALUs to run at 16-bit precision and comparing to 32-bit precision is also a generous trade in the opposite direction, and for like-like comparison we should always stick to 32-bit precision only (which is fine by me, by the way...;)).

    I agree that it's just very difficult to do this analysis in a 'fair' manner generally, and I certainly see your point about the additional capabilities on R5xx ALUs, but I just don't see how we can count us as having an extra ADD without counting an extra NRM, for instance. If you want to say that G70 has only 24 ALUs then I guess you can do that, but then you are just reversing the problem, because you are then saying it's fair to equate something like -

    (MAD + miniALU_X + NRM + MAD + miniALU_Y)

    as a functional unit to:

    (miniALU_Z + MAD)

    I guess the one thing that we can say for sure is that any way in which we choose to frame this someone is going to feel hard done by.
     
  20. ERP

    ERP Moderator
    Moderator Veteran

    Joined:
    Feb 11, 2002
    Messages:
    3,669
    Likes Received:
    49
    Location:
    Redmond, WA
    This seems like a totally pointless metric to me.

    Without understanding Co-Issue restrictions, number of register ports, how the ALU's are actually mapped to hardware specific shader instructions, register usage costs etc.

    All your really doing is measuring overall through put in a specific test and dividing by some arbitrary number. And as evidenced by this thread you can't even decide what you should be dividing by.

    You simply cannot isolate the number you are trying to measure without architectural details and tests tailored to test that one element.
     
Loading...

Share This Page

  • About Us

    Beyond3D has been around for over a decade and prides itself on being the best place on the web for in-depth, technically-driven discussion and analysis of 3D graphics hardware. If you love pixels and transistors, you've come to the right place!

    Beyond3D is proudly published by GPU Tools Ltd.
Loading...