What can be defined as an ALU exactly?

Jawed said:
Since thread sizing is the primary mechanism for hiding the latency of texturing, I think it's quite likely that the total length of 500-700MHz GPU pipelines is under 10 cycles, and could easily be in the region of 5 or 6, including instruction fetch/decode/issue and register fetch.
Nope. At 500 Mhz with standard-cell logic and FP32 precision, even a single DOT3 alone is going to take 6 to 8 pipeline stages, not counting the steps associated with getting instructions/data in and out of the DOT3 execution unit. Texturing is about an order of magnitude worse, somewhat depending on the mechanism used to provide latency tolerance.
 
JF_Aidan_Pryde said:
During the time when G70 and R520 were competing for top spot, the clock disparity was 45%. Even now, between the G70 512MB and R580, it's 18% in ATI's favour.
  • Look at midrange cards, e.g. RV410 versus NV43
  • At 90nm R580 and G71 are going to be within 10% of each other
I really don't see how dividing up texturing units help define a pipeline. How the heck would one classify the parhelia then?
You've got me, I have no idea what Parhelia is - literally. Have you read Andy's posts?

That's what I mean. But you've described it both as '1 quad of 16' and four quads of four! I'm probably misreading something.
A problem arises because Andy described R580 as a 16-pipeline part with 12 fragments being shaded per "quad". That's effectively "per TMU quad".

A question though: if the NV40 has all pipeline executing the same instruction, what's the point of having quad groups of pipelines?
Because texturing is intrinsically a quad-based operation (as near as dammit) it makes sense for the TMUs to be arranged in quads, and since NV40 has tightly coupled ALUs and TMUs, everything goes together as four sets of quads.

I thought the very point of having quad groups was that within the group it's all doing the same instruction. If all sixteen pipes are doing the same instruction, doesn't that mean you need each triangle to be at least 16 fragments big in order to be fully utilising the pipeline? Is this the same case with the X800 and G70?
In NV40/G70 it's possible to put multiple triangles into a thread, and therefore there's very low risk of any single pipeline being "unused".

R3xx...R4xx...R5xx appear to work strictly on one triangle per thread so small triangles do hurt these architectures somewhat. I'm guessing this is why a thread in these architectures (256 fragments) is smaller than in NV40 (4096) and G70 (1024).

I thought to count the number of fragment pipelines, you count the number of fragments that can be outputted per clock; 16 for the NV40 and R520, 24 for the G70 and 48 for the R580.
Well I tend to agree :smile:

But in pure engineering terms, the ganging of fragment processing into quads and/or arrays (e.g. Xenos) as I've been describing, means there are many less actual pipelines than marketing would have you believe.

Jawed
 
Jawed said:
You've got me, I have no idea what Parhelia is - literally. Have you read Andy's posts?
The Parhelia was Matrox' last high-end product. Released not long before the 9700 Pro, this was the last of the high-end DX8 products. Its primary claim to fame was the support for three displays (triple head gaming). With some games, you could make use of the three-display output for a panoramic view. It also supported "Fragment Anti-Aliasing" which was a method of selective supersampling where the card would only supersample those parts of the frame that were on the edges of objects (not all triangle edges: it attempted to separate out those which would cause aliasing, such as the silhouette of a mesh). The supersampling was 16x ordered-grid, and its performance hit was roughly on par with that of the GeForce4 Ti's 4x multisampling.

Its performance was subpar (for the price and time), though, and thus was overshadowed when the DX9 cards started to show.

If I remember correctly, it supported a large number of texture ops per pipeline, something like 3-4.
 
arjan de lumens said:
Nope. At 500 Mhz with standard-cell logic and FP32 precision, even a single DOT3 alone is going to take 6 to 8 pipeline stages, not counting the steps associated with getting instructions/data in and out of the DOT3 execution unit.
A DP4 in Xenon takes 14 stages at 3.2GHz. At 600MHz I'd expect it to take way less than half the number of stages.

And even with "long instructions" such as DP or RSQ, simpler ADD or MUL should run in 1 cycle.

Texturing is about an order of magnitude worse, somewhat depending on the mechanism used to provide latency tolerance.
But that's a separate pipeline with its own startup cost (i.e. fetch from memory or L2 as required) - and with texture data in cache a bilinear operation is supposed to be 1 cycle.

I was focussing on the NV40/G70 ALU pipeline and pointing out that fragment instructions are possibly staggered across the two shader units, rather than dual-issue occurring on a single fragment on both shader units.

---

As a matter of interest, Xenon at 3.2GHz has a ~610 cycle L2 miss penalty with 700MHz GDDR3 as the memory (via Xenos's northbridge, i.e. extra delay). At 600MHz, that's about 115 clock cycles, which is about half the thread duration (256 clocks) for one instruction in G70.

Jawed
 
In traditional terms Parhelia was a 4x4 architechture, when NV30 was 4x2 and R300 was 8x1. Its performance was so low mainly bacause it lacked all Z-occlusion culling capabilities.
 
JF_Aidan_Pryde said:
I thought to count the number of fragment pipelines, you count the number of fragments that can be outputted per clock; 16 for the NV40 and R520, 24 for the G70 and 48 for the R580.
Outputted to where? I thought all of the above can output upto 16 fragments into framebuffer. :)
Anyway, I think I understand what you mean, but that number seems pretty much meaningless, as it doesn't describe the architecture nor the performance. All this nonsense of ALU A is 1.5x ALU B seems as pointless as comparing the MIPS or IPC numbers of modern processors. There plain and simple isn't an IPC without qualifications. Either you compare performance (per chip or wall clock) running one or another shader or you try to specify the architecture in full detail. There just plain isn't any other relevant way to compare so differing architectures.
 
Jawed said:
But that's a separate pipeline with its own startup cost (i.e. fetch from memory or L2 as required) - and with texture data in cache a bilinear operation is supposed to be 1 cycle.
The "texturing pipeline" isn't separate in NV3x/NV4x/G7x. And bilinear filtering/sampling has 1 quad/cycle throughput, but it takes several cycles.

I was focussing on the NV40/G70 ALU pipeline and pointing out that fragment instructions are possibly staggered across the two shader units, rather than dual-issue occurring on a single fragment on both shader units.
It's a pipeline, so different stages are working on different quads.
 
Xmas said:
The "texturing pipeline" isn't separate in NV3x/NV4x/G7x.

From:

http://www.beyond3d.com/previews/nvidia/nv40/index.php?p=9


ps.gif


It's just the texture address calculation, in dependent texturing, that occupies shader unit 1, as far as I can tell.

Half-way down this page:

http://www.3dcenter.org/artikel/nv40_technik/index2_e.php

the TMU is shown as separated from the ALU pipeline.

Jawed
 
A bit aside question about the tex caches in NV40 - is it true that L1 tex cache is common per single quad, like the L2 is for all of the quads?
 
andypski said:
You seem to want to count our ALUs as more than one for some reason, while you seem to be quite happy to simply treat each of nVidia's ALUs as a pure MAD, however I don't see why this is valid at all.

Oh, didn't know you worked for ATi :smile: And no, that's not what I'm doing at all. My 2 for G70 and 1.5 for R580 is down to the commonly referenced simplified MADD+MADD and MADD+ADD capabilities of each respective shader, notwithstanding architectural differences.

andypski said:
So each ALU in G70 apparently has a MAD, one of them also gets to do a normalization in parallel and both of them also have a 'mini' ALU, which can apparently perform some range of tasks (the full details of which I guess are undisclosed, but I expect at least modifiers like 2x, 4x and probably other things).

Why should they get a pass on all these additional capabilities, while you feel that ours have to be accounted for by some scaling? We don't have a parallel normalizing unit - nVidia chose to spend area there, while I guess we spent it elsewhere - why do you choose to ignore it?

The ALUs of each company are different, sharing some characteristics and differing in others, reflecting the design decisions we each made. Perhaps each of ours does more per-clock on average than the competition, but why does that mean we should suddenly apply a scaling factor of 1.5 to each of our ALUs? Just to make the performance seem less impressive?

What you've outlined above is exactly why per-shader performance is a much better metric than per-ALU performance since the numbers aren't obfuscated by the intricate architectural details.

What you've still not addressed is your justification for considering R580 a 16-shader part, while at the same time considering G70 a 24-shader part. The only fair comparisons I can see are 6/12 (quads), 24/48 (shaders) or 48/48 (ALUs) (or 48/96 if we count the ADD) for G70/R580.

And if you want to get into the discounting game, considering R580 a 48 ALU part still discounts the ADD of the first ALU. I think that's a very generous trade for Nvidia's mini-ALU.
 
Last edited by a moderator:
Jawed said:
A DP4 in Xenon takes 14 stages at 3.2GHz. At 600MHz I'd expect it to take way less than half the number of stages.
1. CPUs are not usually designed with standard cell libraries.
2. CPUs are optimized almost exclusively for (effective) latency of operations, and clock speed.

A lot more transistors are being used to implement that DP4 at 3.2 GHz than a DP4 on a GPU. A high-end GPU would also have significantly more capability to compute DP4s than Xenon.


fellix said:
A bit aside question about the tex caches in NV40 - is it true that L1 tex cache is common per single quad, like the L2 is for all of the quads?

According to this diagram, yes
 
Jawed said:
  • Look at midrange cards, e.g. RV410 versus NV43
  • At 90nm R580 and G71 are going to be within 10% of each other
Okay but surely you can agree that by measuring math/second, we're measuring throughput which is everybit as important as math/clock. And when there's a large clock difference, it's especially important.

You've got me, I have no idea what Parhelia is - literally. Have you read Andy's posts?
A four pipeline card with four TMUs per pipe. It was cool while it lasted. :)

I re-read Andy's post. He's basically dividing up the pipes by texturing units because of the odd numbers present in the R580. But I don't see how based on that you can say that the NV40 is a one pipeline card. It has the same number of shaders as texture units. So it's just a 16-pipe card.

Because texturing is intrinsically a quad-based operation (as near as dammit) it makes sense for the TMUs to be arranged in quads, and since NV40 has tightly coupled ALUs and TMUs, everything goes together as four sets of quads.
Can you provide more details? Don't texture units already fetch four samples at full speed? How does ganging four texture units together help?

In NV40/G70 it's possible to put multiple triangles into a thread, and therefore there's very low risk of any single pipeline being "unused".
Do the triangles in the thread have to be physically adjacent to each other?

Well I tend to agree :smile:

But in pure engineering terms, the ganging of fragment processing into quads and/or arrays (e.g. Xenos) as I've been describing, means there are many less actual pipelines than marketing would have you believe.
Only if you define the GPU in terms of the number of shader states. And in terms of ganging pipes, the fragment rate is only reduced if it the ganging is serial; so long as there are 'n' shader units all outputting fragments in parallel, I think it's totally valid to describe the G70 as a 24 fragment-pipe part and the R580 as a 48 fragment-pipe part. Of course the contents of the pipeline deserve separate discussion. :)
 
Dave Baumann said:
Their diagrams also represent it closer to the patent, dependant on how they are visualising it:

http://www.beyond3d.com/previews/nvidia/nv40/index.php?p=10
[posting this despite reservations of pissing into the wind]

As far as I can tell that's simply indicating that the first shader unit is either doing shader arithmetic or it's calculating dependent texture addressing.

The fundamental issue here is that if the TMU is truly in-line and the resulting pipeline was dozens if not hundreds of clocks long, then there'd be no need for threads (or they could consist of a few 10s of fragments, not hundreds as they actually do).

As I showed earlier, typical GDDR3 fetch latency is easily hidden solely by per-quad-pipe thread size (~115 cycles of latency with 700MHz GDDR3 on a 600MHz GPU is easily hidden by a 256-cycle-per-instruction thread).

Additionally we can clearly see in R3xx etc. that the semi-decoupled texturing of that architecture requires a specific texture address calculation ALU (which people continue to forget to count when "counting pipeline ALUs") which then feeds a texturing pipeline. So texturing proceeds asynchronously.

shadercore.jpg


Xenos, of course, is the model of fully decoupled texturing, but R5xx also achieves the same in the context of pixel shading:

over.jpg


where the thread in the shader core will often not be the same as the thread in the corresponding texture unit.

All that's happening in NV40/G70 is that there is no dedicated ALU for dependent texturing address calculations (so shader unit 1 is overloaded) - texturing itself proceeds asynchronously, with typical thread-sizes enabling the texture pipe to produce its results before the fragment returns to context and needs that result, 1, 2 or more instructions later - with bilinear filtering that will normally be the following instruction, a minimum of 256 cycles after the texture operation is commenced.

Jawed

EDIT: removed the bit about 32-fold versus 8-fold - sigh, brain attack...
 
Last edited by a moderator:
Bob said:
1. CPUs are not usually designed with standard cell libraries.
2. CPUs are optimized almost exclusively for (effective) latency of operations, and clock speed.
I think that 14-stage DP4 pipeline might actually be so long because it's an SMT architecture - so it could arguably be half that length :smile:

A MADD in Xenon is 12 stages, as compared with 6 in Cell SPE (also at 3.2GHz). Maybe I should have used Cell SPE's vector pipeline for the comparison :oops:

Jawed
 
JF_Aidan_Pryde said:
Okay but surely you can agree that by measuring math/second, we're measuring throughput which is everybit as important as math/clock. And when there's a large clock difference, it's especially important.
That might be what you are interested in, but it's not what I was interested in.

I re-read Andy's post. He's basically dividing up the pipes by texturing units because of the odd numbers present in the R580. But I don't see how based on that you can say that the NV40 is a one pipeline card. It has the same number of shaders as texture units. So it's just a 16-pipe card.
Ah well, never mind, it's not an interesting perspective if you're not interested by the architecture, per se.

Can you provide more details? Don't texture units already fetch four samples at full speed? How does ganging four texture units together help?
Prolly a semantic thing, I'm simply saying that four texels together is the natural order of things:

http://www.3dcenter.org/artikel/nv40_technik/index3_e.php

Do the triangles in the thread have to be physically adjacent to each other?
Dunno! Maybe Bob will say. I doubt they do, since it's really about shader state.

Jawed
 
trinibwoy said:
What you've outlined above is exactly why per-shader performance is a much better metric than per-ALU performance since the numbers aren't obfuscated by the intricate architectural details.
I believe I was originally looking more at "per-shader" performance earlier rather than per ALU, but people seemed to want to look at per-ALU for some reason.
What you've still not addressed is your justification for considering R580 a 16-shader part, while at the same time considering G70 a 24-shader part.
Funny - I thought I had addressed that, but I'll try again.

We want to compare the respective architectures, and for the purposes of comparison we want to divide them up into chunks called 'shaders', each of which can execute a pixel shader program.

What do you need to have in order to run a basic pixel shader program?

- To be able to run a generic pixel shader program you require ALU and texture resources (and potentially flow control etc., but let's keep things relatively simple)

So a 'unit' to run a pixel shader program needs ALU and texture resources, so I divide the respective architectures up evenly into chunks that meet the criteria and call them 'shaders'. This results in the following divisions -

R580 - 16 "shaders", each with 1 texture resource and 3 ALU resources
G70 - 24 "shaders", each with 1 texture/ALU resource and 1 dedicated ALU resource.

Dividing things in this way doesn't necessarily have anything to do with the underlying architecture - it's just a way to form a basis for comparison. I guess you could actually pick _any_ basis, as long as your assumptions are consistent and correct, and perform a comparison.

For example, if you want to look at the ALU only case we could choose to discount the texture resources entirely and choose to say that G70 has 48 ALU's and R580 has 48 ALUs, which is what I did in the earlier post about the Cook-Torrance shader performance.

The problem then obviously comes back to the more thorny area of the debate which is "what is considered an ALU?". A lot of people seem to feel that it's an individual block that contains a MAD unit, which was how I framed it, but as you correctly point out this is not the only possibility by any means.

And if you want to get into the discounting game, considering R580 a 48 ALU part still discounts the ADD of the first ALU

I wasn't playing a discounting game - I was playing a simplifying game. :)

When I initially started comparing things I was pretty content to work at a more abstract level - take each ALU chunk from each architecture as a black-box and simply compare the apparent execution characteristics on the supplied shaders. I was not really looking for why A was faster or slower than B, or whether A is more expensive in silicon than B, which are also interesting questions to answer.

Mainly the interest seemed to be in comparing the performance characteristics of the ALUs of the two designs, which is what I did. Examining the exact tradeoffs that went into the different designs and their detailed behaviour would be far more complex. Both architectures have "ALUs" that are more complex and perform more operations than a simple MAD.

I think that's a very generous trade for Nvidia's mini-ALU.
Whether it's a generous trade or not would depend on what the full capabilities of that mini-ALU are (and things like the NRM unit, of course). We could equally say that allowing particular ALUs to run at 16-bit precision and comparing to 32-bit precision is also a generous trade in the opposite direction, and for like-like comparison we should always stick to 32-bit precision only (which is fine by me, by the way...;)).

I agree that it's just very difficult to do this analysis in a 'fair' manner generally, and I certainly see your point about the additional capabilities on R5xx ALUs, but I just don't see how we can count us as having an extra ADD without counting an extra NRM, for instance. If you want to say that G70 has only 24 ALUs then I guess you can do that, but then you are just reversing the problem, because you are then saying it's fair to equate something like -

(MAD + miniALU_X + NRM + MAD + miniALU_Y)

as a functional unit to:

(miniALU_Z + MAD)

I guess the one thing that we can say for sure is that any way in which we choose to frame this someone is going to feel hard done by.
 
I believe I was originally looking more at "per-shader" performance earlier rather than per ALU, but people seemed to want to look at per-ALU for some reason.

This seems like a totally pointless metric to me.

Without understanding Co-Issue restrictions, number of register ports, how the ALU's are actually mapped to hardware specific shader instructions, register usage costs etc.

All your really doing is measuring overall through put in a specific test and dividing by some arbitrary number. And as evidenced by this thread you can't even decide what you should be dividing by.

You simply cannot isolate the number you are trying to measure without architectural details and tests tailored to test that one element.
 
Back
Top