I don't think they are inline, in that they are passing anything from one to another, AFAIK they are completely parallel. For that diagram I would tun them on their sides and get id of the blue links between.
Well, apart from the above diagram that I showed to Eric asking his opinion on it beforehand, the other element pointing against inline is the shader compiler optimiser, one of the reasons that ATI cite as sticking with the same basic "Per Pixel Shader ALU" structure - if they were inline then they would have greater depedancies on one another, changing the nature of the of the compiler optimiser, if they are just multiple pixels issued in parallel then this doesn't actually need to be changed.fellix said:But assuming the known thread-batch size and count on R580 compared to R520, I'm not very convinced in the "full parallel" aligning, unless you mighty Dave have some trusted internal info.
andypski said:If you think otherwise then I might be so bold as to suggest that you might be getting it "horribly wrong".
Mintmaster said:Okay, I didn't write that very well. There are two points I wanted to make. First, having an additional MADD per clock doesn't get you very much. Second, the G70 pipeline isn't much faster than a R520 pipeline most of the time.
Indeed, I don't know how G70 would perform without the second ALU, so I was wrong in the way I wrote that statement.
Jawed said:I'm trying to view per-fragment arithmetic rate, with the ALU structure treated as a black box.
Since NV40 and R420 appeared, we've known that "per fragment, per clock" the significantly more complex ALU architecture of the NVidia "superscalar" design gives it an advantage, particularly with relatively short shaders or with _PP.
Where does the sixteen pipeline come from? Only 16 x 3 = 48 and I can't see a three in the R580 anywhere. I do agree it's four "quads" of 12.If R580 is a 16-pipeline GPU (four "quads" of 12),
I don't think I follow. "One quad of 16"? If I understand 'quad' correctly, a 2x2 pixel region of a triangle rendered by four coupled pixel pipes, then the NV40 is surely four 'quads' of four. That it's a SIMD architecture would mean all quads are undergoing the same shader program...then that makes NV40 a 4 pipeline GPU (one "quad" of 16), as all 16 fragments being shaded in NV40 have identical shader state (even if they're on different triangles).
You seem to want to count our ALUs as more than one for some reason, while you seem to be quite happy to simply treat each of nVidia's ALUs as a pure MAD, however I don't see why this is valid at all.trinibwoy said:I do think otherwise, since you are obviously equating a single G70 shader (2 ALU's each) with 3 full R580 shaders (~ 1.5 ALU's each). Each R580 shader cannot be considered a "single ALU" as you have done in your comparison above. Even if you consider comparing per-shader performance as useless, comparing per-ALU performance is even more irrevant and useless, IMO, especially using your definition of an ALU.
Jawed said:You guys with your GPU simulators are the lucky ones I'd love to know how a 2:1 R580 (instead of 3:1) would have performed in games - I suspect it would have been practically identical to a 3:1 R580.
I dare say we'll be waiting a long time before any games really stretch the 3:1 ratio.
Jawed
Except that with the competing architectures clocking to within 10% of each other, I don't think that argument holds much sway.JF_Aidan_Pryde said:I am with you with respect to measuring arithmetic rate but I think it should be measured as per second rather than per clock. If you measure by per clock, NV's design will always come out on top since it does more per clock by design. But this design also means they are clocked lower. ATI does less per clock but is clocked higher. It's all very similar to the ILP vs. clock speed debate with the Pentium 4 and Athlon. So I'd measure math/second as opposed to math per clock.
You don't see a 3 in R580 I'm confused with what you're saying and I'm wondering if you've read Andy's posts.Where does the sixteen pipeline come from? Only 16 x 3 = 48 and I can't see a three in the R580 anywhere. I do agree it's four "quads" of 12.
I put "quad" in quotes deliberately to point up the strange organisation I was describing.I don't think I follow. "One quad of 16"? If I understand 'quad' correctly, a 2x2 pixel region of a triangle rendered by four coupled pixel pipes, then the NV40 is surely four 'quads' of four. That it's a SIMD architecture would mean all quads are undergoing the same shader program.
I think it's pipelined much more deeply than that.Jawed said:As a matter of interest, I think there's a theory that NV40 and G70 actually have two fragments in context at any given time, with fragment A in shader unit 2 and fragment B in shader unit 1. On the next clock, fragment B is in shader unit 2 and fragment C is in shader unit 1. Can't remember where I came across this, though...
Jawed
Indeed, but most of that deep pipeline is just for texture latency hiding. The ALUs probably have only a handful of stages each.Chalnoth said:I think it's pipelined much more deeply than that.
Well, except it seems like the first ALU is shared with the texture unit, and thus would seem to require the same amount of latency. About the second one you're probably right, but we're still talking much more than one fragment at a time in the second ALU (I'd guess 4 at a minimum, quite possibly more).Xmas said:Indeed, but most of that deep pipeline is just for texture latency hiding. The ALUs probably have only a handful of stages each.
The first ALU is not "shared", it sits before the TMU. And obviously it's several quads, one in every pipeline stage.Chalnoth said:Well, except it seems like the first ALU is shared with the texture unit, and thus would seem to require the same amount of latency. About the second one you're probably right, but we're still talking much more than one fragment at a time in the second ALU (I'd guess 4 at a minimum, quite possibly more).
Ah, yeah, that's gotta be true. Nevermind on that point.Xmas said:The first ALU is not "shared", it sits before the TMU. And obviously it's several quads, one in every pipeline stage.
During the time when G70 and R520 were competing for top spot, the clock disparity was 45%. Even now, between the G70 512MB and R580, it's 18% in ATI's favour.Jawed said:Except that with the competing architectures clocking to within 10% of each other, I don't think that argument holds much sway.
I really don't see how dividing up texturing units help define a pipeline. How the heck would one classify the parhelia then?You don't see a 3 in R580 I'm confused with what you're saying and I'm wondering if you've read Andy's posts.
That's what I mean. But you've described it both as '1 quad of 16' and four quads of four! I'm probably misreading something.Yes, NV40 has 4 quads, each quad consists of four ALU and TMU pipes. NV40 only has a single shader state, though, with one instruction in one shader being executed across all 16 fragments.
I thought to count the number of fragment pipelines, you count the number of fragments that can be outputted per clock; 16 for the NV40 and R520, 24 for the G70 and 48 for the R580.Do you define the fragment pipeline count by how many texture operations a GPU can do in parallel, or by the number of shader states it can support concurrently, or by the number of fragments that are in context?