What can be defined as an ALU exactly?

Discussion in 'Architecture and Products' started by Ailuros, Feb 17, 2006.

  1. Dave Baumann

    Dave Baumann Gamerscore Wh...
    Moderator Legend

    Joined:
    Jan 29, 2002
    Messages:
    14,090
    Likes Received:
    694
    Location:
    O Canada!
    I don't think they are inline, in that they are passing anything from one to another, AFAIK they are completely parallel. For that diagram I would tun them on their sides and get id of the blue links between.
     
  2. Jawed

    Legend

    Joined:
    Oct 2, 2004
    Messages:
    11,716
    Likes Received:
    2,137
    Location:
    London
    It seems like a reasonable diagram.

    If I was drawing such a diagram, and based-upon the R520 die shot:

    [​IMG]

    I'd want to emphasise that physical locality is actually a function of the fragment pipelines, particularly as screen-space tiling is a key concept in R5xx (inherited from R3xx and R4xx).

    To that end, I would group the thread despatch, texture address calculation, texturing, texture cache, GPRs, ROPs, Z/stencil cache and colour buffer cache into blocks (total four). When a triangle is scan-converted, individual fragments have a guaranteed path through the GPU, physically constrained to one of the four primary pipelines.

    Jawed
     
  3. fellix

    Veteran

    Joined:
    Dec 4, 2004
    Messages:
    3,552
    Likes Received:
    514
    Location:
    Varna, Bulgaria
    But assuming the known thread-batch size and count on R580 compared to R520, I'm not very convinced in the "full parallel" aligning, unless you mighty Dave have some trusted internal info. :D

    btw, here is the R520, but like in the 580 counterpart I had trouble in placing the memory controller and ring-bus routing, as it seems a bit too complex and unclear what isthe wiring realtion with the core sections, so it's absence from the drawings. Any help?
     
  4. Jawed

    Legend

    Joined:
    Oct 2, 2004
    Messages:
    11,716
    Likes Received:
    2,137
    Location:
    London
  5. Dave Baumann

    Dave Baumann Gamerscore Wh...
    Moderator Legend

    Joined:
    Jan 29, 2002
    Messages:
    14,090
    Likes Received:
    694
    Location:
    O Canada!
    Well, apart from the above diagram that I showed to Eric asking his opinion on it beforehand, the other element pointing against inline is the shader compiler optimiser, one of the reasons that ATI cite as sticking with the same basic "Per Pixel Shader ALU" structure - if they were inline then they would have greater depedancies on one another, changing the nature of the of the compiler optimiser, if they are just multiple pixels issued in parallel then this doesn't actually need to be changed.

    Also, if they were inline then they would be operating on multiple instructions over the same pixel - ATI have already stated that this is not the case as the thread sizes increase in size by 3x with R580/RV530 which is conistent with just issuing 3x pixels in parallel.
     
  6. fellix

    Veteran

    Joined:
    Dec 4, 2004
    Messages:
    3,552
    Likes Received:
    514
    Location:
    Varna, Bulgaria
    OK, then I'll make them turn in 90° si the configuration matches the extended batch-size (eg., from 4x3 to 3x4 "matrix" placement). ;)
     
  7. trinibwoy

    trinibwoy Meh
    Legend

    Joined:
    Mar 17, 2004
    Messages:
    12,059
    Likes Received:
    3,119
    Location:
    New York
    I do think otherwise, since you are obviously equating a single G70 shader (2 ALU's each) with 3 full R580 shaders (~ 1.5 ALU's each). Each R580 shader cannot be considered a "single ALU" as you have done in your comparison above. Even if you consider comparing per-shader performance as useless, comparing per-ALU performance is even more irrevant and useless, IMO, especially using your definition of an ALU.
     
    #67 trinibwoy, Feb 20, 2006
    Last edited by a moderator: Feb 20, 2006
  8. JF_Aidan_Pryde

    Regular

    Joined:
    Feb 6, 2002
    Messages:
    601
    Likes Received:
    3
    Location:
    New York
    Without the second ALU, the first ALU would be unavailable when texturing. For a 1:1 ALU:Tex scenario, that would be halving the performance. So the second ALU is definitely needed, although MADD may not be critical.
     
  9. JF_Aidan_Pryde

    Regular

    Joined:
    Feb 6, 2002
    Messages:
    601
    Likes Received:
    3
    Location:
    New York
    I am with you with respect to measuring arithmetic rate but I think it should be measured as per second rather than per clock. If you measure by per clock, NV's design will always come out on top since it does more per clock by design. But this design also means they are clocked lower. ATI does less per clock but is clocked higher. It's all very similar to the ILP vs. clock speed debate with the Pentium 4 and Athlon. So I'd measure math/second as opposed to math per clock.


    Where does the sixteen pipeline come from? Only 16 x 3 = 48 and I can't see a three in the R580 anywhere. I do agree it's four "quads" of 12.

    I don't think I follow. "One quad of 16"? If I understand 'quad' correctly, a 2x2 pixel region of a triangle rendered by four coupled pixel pipes, then the NV40 is surely four 'quads' of four. That it's a SIMD architecture would mean all quads are undergoing the same shader program.
     
  10. andypski

    Regular

    Joined:
    May 20, 2002
    Messages:
    584
    Likes Received:
    28
    Location:
    Santa Clara
    You seem to want to count our ALUs as more than one for some reason, while you seem to be quite happy to simply treat each of nVidia's ALUs as a pure MAD, however I don't see why this is valid at all.

    Here is a link to a page with an slide, apparently from an nVidia presentation, detailing the capabilities of their ALUs -

    http://www.tomshardware.com/2005/06/22/24_pipelines_of_power/page3.html

    So each ALU in G70 apparently has a MAD, one of them also gets to do a normalization in parallel and both of them also have a 'mini' ALU, which can apparently perform some range of tasks (the full details of which I guess are undisclosed, but I expect at least modifiers like 2x, 4x and probably other things).

    Why should they get a pass on all these additional capabilities, while you feel that ours have to be accounted for by some scaling? We don't have a parallel normalizing unit - nVidia chose to spend area there, while I guess we spent it elsewhere - why do you choose to ignore it?

    The ALUs of each company are different, sharing some characteristics and differing in others, reflecting the design decisions we each made. Perhaps each of ours does more per-clock on average than the competition, but why does that mean we should suddenly apply a scaling factor of 1.5 to each of our ALUs? Just to make the performance seem less impressive?

    Maybe you would like to apply a scaling parameter to the performance of Intel CPUs when compared to AMD ones, since apparently they don't do the same amount of work per clock either?
     
  11. RoOoBo

    Regular

    Joined:
    Jun 12, 2002
    Messages:
    308
    Likes Received:
    31
    The last tests with the usual suspects (UT2004, Doom3, Quake4) show little (~1%) benefit when moving from 2:1 to 3:1. The improvement for 2:1 tops at ~10% (at 1024x768 8xAF) but is quite dependant on other parameters (number of available registers) for that improvement. And as usual the disclaimer that the simulator doesn't accurately represents any known or unknown real GPU and that the benchmarked games may not be representative of current Direct3D games (blame game developers for not releasing more OpenGL games :wink: ).
     
  12. Jawed

    Legend

    Joined:
    Oct 2, 2004
    Messages:
    11,716
    Likes Received:
    2,137
    Location:
    London
    Except that with the competing architectures clocking to within 10% of each other, I don't think that argument holds much sway.

    You don't see a 3 in R580 :shock: :?: I'm confused with what you're saying and I'm wondering if you've read Andy's posts.

    I put "quad" in quotes deliberately to point up the strange organisation I was describing.

    Yes, NV40 has 4 quads, each quad consists of four ALU and TMU pipes. NV40 only has a single shader state, though, with one instruction in one shader being executed across all 16 fragments.

    Do you define the fragment pipeline count by how many texture operations a GPU can do in parallel, or by the number of shader states it can support concurrently, or by the number of fragments that are in context?

    ---

    As a matter of interest, I think there's a theory that NV40 and G70 actually have two fragments in context at any given time, with fragment A in shader unit 2 and fragment B in shader unit 1. On the next clock, fragment B is in shader unit 2 and fragment C is in shader unit 1. Can't remember where I came across this, though...

    Jawed
     
  13. KimB

    Legend

    Joined:
    May 28, 2002
    Messages:
    12,928
    Likes Received:
    230
    Location:
    Seattle, WA
    I think it's pipelined much more deeply than that.
     
  14. Xmas

    Xmas Porous
    Veteran Subscriber

    Joined:
    Feb 6, 2002
    Messages:
    3,344
    Likes Received:
    176
    Location:
    On the path to wisdom
    Indeed, but most of that deep pipeline is just for texture latency hiding. The ALUs probably have only a handful of stages each.
     
  15. KimB

    Legend

    Joined:
    May 28, 2002
    Messages:
    12,928
    Likes Received:
    230
    Location:
    Seattle, WA
    Well, except it seems like the first ALU is shared with the texture unit, and thus would seem to require the same amount of latency. About the second one you're probably right, but we're still talking much more than one fragment at a time in the second ALU (I'd guess 4 at a minimum, quite possibly more).
     
  16. Xmas

    Xmas Porous
    Veteran Subscriber

    Joined:
    Feb 6, 2002
    Messages:
    3,344
    Likes Received:
    176
    Location:
    On the path to wisdom
    The first ALU is not "shared", it sits before the TMU. And obviously it's several quads, one in every pipeline stage.
     
  17. fellix

    Veteran

    Joined:
    Dec 4, 2004
    Messages:
    3,552
    Likes Received:
    514
    Location:
    Varna, Bulgaria
    Yes - the first ALU has the task to channel the the texture data coordinates to the attached TMU, so either way it is affected by the latency, but that not means the MUL operators are hogged all the time, so with some smart reordering is possible to utilize the ALU for math op's in tex fetch interims.
     
  18. KimB

    Legend

    Joined:
    May 28, 2002
    Messages:
    12,928
    Likes Received:
    230
    Location:
    Seattle, WA
    Ah, yeah, that's gotta be true. Nevermind on that point.
     
  19. Jawed

    Legend

    Joined:
    Oct 2, 2004
    Messages:
    11,716
    Likes Received:
    2,137
    Location:
    London
    Since thread sizing is the primary mechanism for hiding the latency of texturing, I think it's quite likely that the total length of 500-700MHz GPU pipelines is under 10 cycles, and could easily be in the region of 5 or 6, including instruction fetch/decode/issue and register fetch. A chunk of that will prolly relate to register fetch as the huge register file in GPUs increases both indirection(banking) and distance on the die.

    Additionally, the driver compiler should mean that instruction decode/issue should proceed very swiftly as the issue complexity can be analysed at compile time.

    Finally, due to threading, instruction fetch/decode/issue is only required irregularly (e.g. once every 256 fragments), so there's little reason to count that in the total execution length of the shader pipeline.

    So in terms of active pipelining, we're left with register fetch and computation in shader units 1 and 2, with feed-forward of computed results from shader unit 1 to TMU or shader unit 2.

    There's an awful lot written about the NV40 pipeline in:

    http://www.3dcenter.org/artikel/nv40_pipeline/index_e.php

    Jawed
     
  20. JF_Aidan_Pryde

    Regular

    Joined:
    Feb 6, 2002
    Messages:
    601
    Likes Received:
    3
    Location:
    New York
    During the time when G70 and R520 were competing for top spot, the clock disparity was 45%. Even now, between the G70 512MB and R580, it's 18% in ATI's favour.

    I really don't see how dividing up texturing units help define a pipeline. How the heck would one classify the parhelia then?

    That's what I mean. But you've described it both as '1 quad of 16' and four quads of four! I'm probably misreading something.

    A question though: if the NV40 has all pipeline executing the same instruction, what's the point of having quad groups of pipelines? I thought the very point of having quad groups was that within the group it's all doing the same instruction. If all sixteen pipes are doing the same instruction, doesn't that mean you need each triangle to be at least 16 fragments big in order to be fully utilising the pipeline? Is this the same case with the X800 and G70?

    I thought to count the number of fragment pipelines, you count the number of fragments that can be outputted per clock; 16 for the NV40 and R520, 24 for the G70 and 48 for the R580.
     
Loading...

Share This Page

  • About Us

    Beyond3D has been around for over a decade and prides itself on being the best place on the web for in-depth, technically-driven discussion and analysis of 3D graphics hardware. If you love pixels and transistors, you've come to the right place!

    Beyond3D is proudly published by GPU Tools Ltd.
Loading...