What can be defined as an ALU exactly?

I don't think they are inline, in that they are passing anything from one to another, AFAIK they are completely parallel. For that diagram I would tun them on their sides and get id of the blue links between.
 
It seems like a reasonable diagram.

If I was drawing such a diagram, and based-upon the R520 die shot:

die.jpg


I'd want to emphasise that physical locality is actually a function of the fragment pipelines, particularly as screen-space tiling is a key concept in R5xx (inherited from R3xx and R4xx).

To that end, I would group the thread despatch, texture address calculation, texturing, texture cache, GPRs, ROPs, Z/stencil cache and colour buffer cache into blocks (total four). When a triangle is scan-converted, individual fragments have a guaranteed path through the GPU, physically constrained to one of the four primary pipelines.

Jawed
 
But assuming the known thread-batch size and count on R580 compared to R520, I'm not very convinced in the "full parallel" aligning, unless you mighty Dave have some trusted internal info. :D

btw, here is the R520, but like in the 580 counterpart I had trouble in placing the memory controller and ring-bus routing, as it seems a bit too complex and unclear what isthe wiring realtion with the core sections, so it's absence from the drawings. Any help?
 
fellix said:
But assuming the known thread-batch size and count on R580 compared to R520, I'm not very convinced in the "full parallel" aligning, unless you mighty Dave have some trusted internal info. :D
Well, apart from the above diagram that I showed to Eric asking his opinion on it beforehand, the other element pointing against inline is the shader compiler optimiser, one of the reasons that ATI cite as sticking with the same basic "Per Pixel Shader ALU" structure - if they were inline then they would have greater depedancies on one another, changing the nature of the of the compiler optimiser, if they are just multiple pixels issued in parallel then this doesn't actually need to be changed.

Also, if they were inline then they would be operating on multiple instructions over the same pixel - ATI have already stated that this is not the case as the thread sizes increase in size by 3x with R580/RV530 which is conistent with just issuing 3x pixels in parallel.
 
OK, then I'll make them turn in 90° si the configuration matches the extended batch-size (eg., from 4x3 to 3x4 "matrix" placement). ;)
 
andypski said:
If you think otherwise then I might be so bold as to suggest that you might be getting it "horribly wrong".

I do think otherwise, since you are obviously equating a single G70 shader (2 ALU's each) with 3 full R580 shaders (~ 1.5 ALU's each). Each R580 shader cannot be considered a "single ALU" as you have done in your comparison above. Even if you consider comparing per-shader performance as useless, comparing per-ALU performance is even more irrevant and useless, IMO, especially using your definition of an ALU.
 
Last edited by a moderator:
Mintmaster said:
Okay, I didn't write that very well. There are two points I wanted to make. First, having an additional MADD per clock doesn't get you very much. Second, the G70 pipeline isn't much faster than a R520 pipeline most of the time.

Indeed, I don't know how G70 would perform without the second ALU, so I was wrong in the way I wrote that statement.

Without the second ALU, the first ALU would be unavailable when texturing. For a 1:1 ALU:Tex scenario, that would be halving the performance. So the second ALU is definitely needed, although MADD may not be critical.
 
Jawed said:
I'm trying to view per-fragment arithmetic rate, with the ALU structure treated as a black box.

Since NV40 and R420 appeared, we've known that "per fragment, per clock" the significantly more complex ALU architecture of the NVidia "superscalar" design gives it an advantage, particularly with relatively short shaders or with _PP.

I am with you with respect to measuring arithmetic rate but I think it should be measured as per second rather than per clock. If you measure by per clock, NV's design will always come out on top since it does more per clock by design. But this design also means they are clocked lower. ATI does less per clock but is clocked higher. It's all very similar to the ILP vs. clock speed debate with the Pentium 4 and Athlon. So I'd measure math/second as opposed to math per clock.


If R580 is a 16-pipeline GPU (four "quads" of 12),
Where does the sixteen pipeline come from? Only 16 x 3 = 48 and I can't see a three in the R580 anywhere. I do agree it's four "quads" of 12.

..then that makes NV40 a 4 pipeline GPU (one "quad" of 16), as all 16 fragments being shaded in NV40 have identical shader state (even if they're on different triangles).
I don't think I follow. "One quad of 16"? If I understand 'quad' correctly, a 2x2 pixel region of a triangle rendered by four coupled pixel pipes, then the NV40 is surely four 'quads' of four. That it's a SIMD architecture would mean all quads are undergoing the same shader program.
 
trinibwoy said:
I do think otherwise, since you are obviously equating a single G70 shader (2 ALU's each) with 3 full R580 shaders (~ 1.5 ALU's each). Each R580 shader cannot be considered a "single ALU" as you have done in your comparison above. Even if you consider comparing per-shader performance as useless, comparing per-ALU performance is even more irrevant and useless, IMO, especially using your definition of an ALU.
You seem to want to count our ALUs as more than one for some reason, while you seem to be quite happy to simply treat each of nVidia's ALUs as a pure MAD, however I don't see why this is valid at all.

Here is a link to a page with an slide, apparently from an nVidia presentation, detailing the capabilities of their ALUs -

http://www.tomshardware.com/2005/06/22/24_pipelines_of_power/page3.html

So each ALU in G70 apparently has a MAD, one of them also gets to do a normalization in parallel and both of them also have a 'mini' ALU, which can apparently perform some range of tasks (the full details of which I guess are undisclosed, but I expect at least modifiers like 2x, 4x and probably other things).

Why should they get a pass on all these additional capabilities, while you feel that ours have to be accounted for by some scaling? We don't have a parallel normalizing unit - nVidia chose to spend area there, while I guess we spent it elsewhere - why do you choose to ignore it?

The ALUs of each company are different, sharing some characteristics and differing in others, reflecting the design decisions we each made. Perhaps each of ours does more per-clock on average than the competition, but why does that mean we should suddenly apply a scaling factor of 1.5 to each of our ALUs? Just to make the performance seem less impressive?

Maybe you would like to apply a scaling parameter to the performance of Intel CPUs when compared to AMD ones, since apparently they don't do the same amount of work per clock either?
 
Jawed said:
You guys with your GPU simulators are the lucky ones :!: I'd love to know how a 2:1 R580 (instead of 3:1) would have performed in games - I suspect it would have been practically identical to a 3:1 R580.

I dare say we'll be waiting a long time before any games really stretch the 3:1 ratio.

Jawed

The last tests with the usual suspects (UT2004, Doom3, Quake4) show little (~1%) benefit when moving from 2:1 to 3:1. The improvement for 2:1 tops at ~10% (at 1024x768 8xAF) but is quite dependant on other parameters (number of available registers) for that improvement. And as usual the disclaimer that the simulator doesn't accurately represents any known or unknown real GPU and that the benchmarked games may not be representative of current Direct3D games (blame game developers for not releasing more OpenGL games ;) ).
 
JF_Aidan_Pryde said:
I am with you with respect to measuring arithmetic rate but I think it should be measured as per second rather than per clock. If you measure by per clock, NV's design will always come out on top since it does more per clock by design. But this design also means they are clocked lower. ATI does less per clock but is clocked higher. It's all very similar to the ILP vs. clock speed debate with the Pentium 4 and Athlon. So I'd measure math/second as opposed to math per clock.
Except that with the competing architectures clocking to within 10% of each other, I don't think that argument holds much sway.

Where does the sixteen pipeline come from? Only 16 x 3 = 48 and I can't see a three in the R580 anywhere. I do agree it's four "quads" of 12.
You don't see a 3 in R580 :oops: :?: I'm confused with what you're saying and I'm wondering if you've read Andy's posts.

I don't think I follow. "One quad of 16"? If I understand 'quad' correctly, a 2x2 pixel region of a triangle rendered by four coupled pixel pipes, then the NV40 is surely four 'quads' of four. That it's a SIMD architecture would mean all quads are undergoing the same shader program.
I put "quad" in quotes deliberately to point up the strange organisation I was describing.

Yes, NV40 has 4 quads, each quad consists of four ALU and TMU pipes. NV40 only has a single shader state, though, with one instruction in one shader being executed across all 16 fragments.

Do you define the fragment pipeline count by how many texture operations a GPU can do in parallel, or by the number of shader states it can support concurrently, or by the number of fragments that are in context?

---

As a matter of interest, I think there's a theory that NV40 and G70 actually have two fragments in context at any given time, with fragment A in shader unit 2 and fragment B in shader unit 1. On the next clock, fragment B is in shader unit 2 and fragment C is in shader unit 1. Can't remember where I came across this, though...

Jawed
 
Jawed said:
As a matter of interest, I think there's a theory that NV40 and G70 actually have two fragments in context at any given time, with fragment A in shader unit 2 and fragment B in shader unit 1. On the next clock, fragment B is in shader unit 2 and fragment C is in shader unit 1. Can't remember where I came across this, though...

Jawed
I think it's pipelined much more deeply than that.
 
Chalnoth said:
I think it's pipelined much more deeply than that.
Indeed, but most of that deep pipeline is just for texture latency hiding. The ALUs probably have only a handful of stages each.
 
Xmas said:
Indeed, but most of that deep pipeline is just for texture latency hiding. The ALUs probably have only a handful of stages each.
Well, except it seems like the first ALU is shared with the texture unit, and thus would seem to require the same amount of latency. About the second one you're probably right, but we're still talking much more than one fragment at a time in the second ALU (I'd guess 4 at a minimum, quite possibly more).
 
Chalnoth said:
Well, except it seems like the first ALU is shared with the texture unit, and thus would seem to require the same amount of latency. About the second one you're probably right, but we're still talking much more than one fragment at a time in the second ALU (I'd guess 4 at a minimum, quite possibly more).
The first ALU is not "shared", it sits before the TMU. And obviously it's several quads, one in every pipeline stage.
 
Yes - the first ALU has the task to channel the the texture data coordinates to the attached TMU, so either way it is affected by the latency, but that not means the MUL operators are hogged all the time, so with some smart reordering is possible to utilize the ALU for math op's in tex fetch interims.
 
Xmas said:
The first ALU is not "shared", it sits before the TMU. And obviously it's several quads, one in every pipeline stage.
Ah, yeah, that's gotta be true. Nevermind on that point.
 
Since thread sizing is the primary mechanism for hiding the latency of texturing, I think it's quite likely that the total length of 500-700MHz GPU pipelines is under 10 cycles, and could easily be in the region of 5 or 6, including instruction fetch/decode/issue and register fetch. A chunk of that will prolly relate to register fetch as the huge register file in GPUs increases both indirection(banking) and distance on the die.

Additionally, the driver compiler should mean that instruction decode/issue should proceed very swiftly as the issue complexity can be analysed at compile time.

Finally, due to threading, instruction fetch/decode/issue is only required irregularly (e.g. once every 256 fragments), so there's little reason to count that in the total execution length of the shader pipeline.

So in terms of active pipelining, we're left with register fetch and computation in shader units 1 and 2, with feed-forward of computed results from shader unit 1 to TMU or shader unit 2.

There's an awful lot written about the NV40 pipeline in:

http://www.3dcenter.org/artikel/nv40_pipeline/index_e.php

Jawed
 
Jawed said:
Except that with the competing architectures clocking to within 10% of each other, I don't think that argument holds much sway.
During the time when G70 and R520 were competing for top spot, the clock disparity was 45%. Even now, between the G70 512MB and R580, it's 18% in ATI's favour.

You don't see a 3 in R580 :oops: :?: I'm confused with what you're saying and I'm wondering if you've read Andy's posts.
I really don't see how dividing up texturing units help define a pipeline. How the heck would one classify the parhelia then?

Yes, NV40 has 4 quads, each quad consists of four ALU and TMU pipes. NV40 only has a single shader state, though, with one instruction in one shader being executed across all 16 fragments.
That's what I mean. But you've described it both as '1 quad of 16' and four quads of four! I'm probably misreading something.

A question though: if the NV40 has all pipeline executing the same instruction, what's the point of having quad groups of pipelines? I thought the very point of having quad groups was that within the group it's all doing the same instruction. If all sixteen pipes are doing the same instruction, doesn't that mean you need each triangle to be at least 16 fragments big in order to be fully utilising the pipeline? Is this the same case with the X800 and G70?

Do you define the fragment pipeline count by how many texture operations a GPU can do in parallel, or by the number of shader states it can support concurrently, or by the number of fragments that are in context?
I thought to count the number of fragment pipelines, you count the number of fragments that can be outputted per clock; 16 for the NV40 and R520, 24 for the G70 and 48 for the R580.
 
Back
Top