What can be defined as an ALU exactly?

andypski · Feb 21, 2006

ERP said:
All your really doing is measuring overall through put in a specific test and dividing by some arbitrary number. And as evidenced by this thread you can't even decide what you should be dividing by.

Well, I can decide what I think we should be dividing by, but that's evidently not going to be the same as what other people think we should be dividing by...

Xmas · Feb 21, 2006

JF_Aidan_Pryde said:
Can you provide more details? Don't texture units already fetch four samples at full speed? How does ganging four texture units together help?

They fetch four texels and filter them, resulting in a single filtered sample. But they are organized as quads because LOD is calculated once per quad based on the differences between the texture coordinates inside the quad. Another interesting property of quad-TMUs is that they usually only need 9 texels to generate 4 bilinear samples.

Do the triangles in the thread have to be physically adjacent to each other?

No.

Jawed said:
It's just the texture address calculation, in dependent texturing, that occupies shader unit 1, as far as I can tell.

Half-way down this page:

http://www.3dcenter.org/artikel/nv40_technik/index2_e.php

the TMU is shown as separated from the ALU pipeline.

Jawed

Those diagrams are too high-level/abstract.

Jawed said:
The fundamental issue here is that if the TMU is truly in-line and the resulting pipeline was dozens if not hundreds of clocks long, then there'd be no need for threads (or they could consist of a few 10s of fragments, not hundreds as they actually do).

Not sure how you come to this conclusion there. The pipeline is very deep, although most of it is just a FIFO for texture latency hiding.

Additionally we can clearly see in R3xx etc. that the semi-decoupled texturing of that architecture requires a specific texture address calculation ALU (which people continue to forget to count when "counting pipeline ALUs") which then feeds a texturing pipeline. So texturing proceeds asynchronously.

R3xx is absolutely not comparable to NV4x/G7x when it comes to shader pipelines.
R3xx has two "loops", a texture loop and an ALU loop, and it executes shaders as (up to 4) phases, each consisting of a tex part and an arithmetic part. NV4x only has one loop that contains both ALUs and TMU.

Code:

   |                 |
   +-----<----+      +-----<----+
   |          |      |          |
   +---<---+  |  MAD/Tex Addr   |
   |       |  |      |          |
Tex Addr   |  |     TMU         ^
   |       ^  |      |          |
  TMU      |  |     MAD         |
   |       |  |      |          |
   +--->---+  |      +----->----+
   |          |      |
   +---<---+  ^
   |       |  | x3
  ADD      |  |
   |       ^  |
  MAD      |  |
   |       |  |
   +--->---+  |
   |          |
   +----------+
   |

(This is not a precise representation of the pipelines, just simple diagrams showing the flow of a single quad.)

All that's happening in NV40/G70 is that there is no dedicated ALU for dependent texturing address calculations (so shader unit 1 is overloaded) - texturing itself proceeds asynchronously, with typical thread-sizes enabling the texture pipe to produce its results before the fragment returns to context and needs that result, 1, 2 or more instructions later - with bilinear filtering that will normally be the following instruction, a minimum of 256 cycles after the texture operation is commenced.

The following instruction will usually execute in the second ALU, which a quad reaches in less than 256 cycles. I initially thought that texturing would proceed asynchronously, too, but I'm not convinced of that any more.

Geo · Feb 21, 2006

I propose an EST-like encounter session between senior ATI and NV execs and engineers. 12 per side. Only breachcloths may be worn, and no tools/supplies may be taken into the room. An observer will watch/document thru one way glass/intercom. No one will leave the room for any reason until an agreement on common terminology and metrics is reached.

KimB · Feb 21, 2006

ERP said:
This seems like a totally pointless metric to me.

No, no, no. This is a totally useless metric:
ds^2 = dx^2

Sorry, couldn't resist

trinibwoy · Feb 21, 2006

andypski said:
So a 'unit' to run a pixel shader program needs ALU and texture resources, so I divide the respective architectures up evenly into chunks that meet the criteria and call them 'shaders'. This results in the following divisions -

R580 - 16 "shaders", each with 1 texture resource and 3 ALU resources
G70 - 24 "shaders", each with 1 texture/ALU resource and 1 dedicated ALU resource.

Ah now I understand your approach. Thanks for the clarification :smile:

Jawed · Feb 21, 2006

So, Xmas, what you're saying is something like this :?:

:

Code:

    |
    +-----<----+
    |          |
MAD/Tex Addr   |
    4          |
    |          |
TMU-fetch      ^
    2          |
 TMU-NOP       |
   134        100
TMU-filter     |
    12         |
    |          |
   MAD         |
    4          |
    |          |
    +----->----+
    |

I've put in "latency" (guesses) for each non-FIFO stage, and then put the remaining latency, 100 stages, as the loop-back FIFO.

The TMU pipeline is set at 148 stages to cover typical worst-case latency fetching from GDDR3 (I've guessed deliberately long here). It is essentially a long sequence of NOPs separating "fetch texels" and "filter texels".

Jawed

nelg · Feb 21, 2006

I am seriously thinking of putting a fund together to buy Jawed a nanoimprint machine.

arjan de lumens · Feb 22, 2006

Jawed said:
I think that 14-stage DP4 pipeline might actually be so long because it's an SMT architecture - so it could arguably be half that length :smile:

A MADD in Xenon is 12 stages, as compared with 6 in Cell SPE (also at 3.2GHz). Maybe I should have used Cell SPE's vector pipeline for the comparison

Jawed

The fastest possible DP4 implementation has a latency of about 1.5 to 1.7 times the fastest possible MADD implementation (the main issue being that there are some alignment shifts and wide additions that can be avoided for MADD bot not DP4).

The Cell architecture uses large amounts of dynamic logic, which gives about 2 to 2.5 times the performance of static logic for any given logic function. Dynamic logic is however extremely process-dependent and horriffically expensive to design (I have seen estimates of about 6 transistors per man per day), which is one important reason that Intel (with about 17x the resources of NV or ATI) has only released a single from-the-ground-up-new x86 core (the "Willamette" Pentium4) in the last 10 years. (If you want to point out PentiumMMX/2/3/M/D/CoreDuo, then don't; all of these are rehashes of previous designs.)

In addition, you have that the company making the Cell design also owns the fabs used to manufacture Cell, and thus has many opportunities to fine-tune Cell and the process towards each other, much more than NV/ATI will ever have with TSMC/UMC. Cell also uses SOI (TSMC/UMC don't) which also affects performance. I'd add a factor of 1.5 here.

Summing up these factors, if we stay with a pipeline length of 6, then the attainable clock speed for DP4 becomes 3.2GHz/(1.7*2.5*1.5) = 502 MHz. There are also issues with size versus speed tradeoff that I haven't covered here (these are likely to be important because a GPU pipeline may be better able to hide latency than the Cell SPE.)

Also, there is no fundamental reason why SMT would double the latency required for any given instruction; in particular, the other consumer-level SMT implementation (Pentium4) does not exhibit such behavior.

Jawed · Feb 22, 2006

arjan de lumens said:
Also, there is no fundamental reason why SMT would double the latency required for any given instruction; in particular, the other consumer-level SMT implementation (Pentium4) does not exhibit such behavior.

It would if the pipeline is designed to interleave two separate threads on alternate clocks.

And P4 isn't SMT.

Jawed

arjan de lumens · Feb 22, 2006

Jawed said:
It would if the pipeline is designed to interleave two separate threads on alternate clocks.

Not at all. In such a case, the added requirement becomes that the pipeline length a multiple of 2, NOT that it becomes twice as long. It's not like data in such a pipelined execution unit stop flowing just because the scheduler doesn't keep pushing additional data into it.

And P4 isn't SMT.

How does "Hyperthreading" fail to match the definition of SMT :?:

Jawed · Feb 22, 2006

arjan de lumens said:
Summing up these factors, if we stay with a pipeline length of 6, then the attainable clock speed for DP4 becomes 3.2GHz/(1.7*2.5*1.5) = 502 MHz. There are also issues with size versus speed tradeoff that I haven't covered here (these are likely to be important because a GPU pipeline may be better able to hide latency than the Cell SPE.)

Interesting post, thanks.

So what you're saying, in effect, is that 6 clocks of arithmetic in Cell at 3.2GHz translates into ~ the same 6 clocks for the same arithmetic at 500MHz. It's nice to get some detailed thoughts.

Though what's transpired (for me, anyway) with the NVidia pipeline, at least, is that it could actually take dozens of clocks to perform a MAD or DP4.

I have to admit it's a bit of a wind-up to see that this is the case, but I suppose with no reason for this architecture to flush the pipeline or re-use the result of a calculation as soon as possible, there's no reason to worry about the pipeline being way way longer than a CPU pipeline for the same kind of functionality.

Still it transpires that shader units 1 and 2 are definitely working on different fragments - which is how this diversion got started. Doesn't seem such an interesting detail, now - but it's been good stuff.

Jawed

Jawed · Feb 22, 2006

arjan de lumens said:
Not at all. In such a case, the added requirement becomes that the pipeline length a multiple of 2, NOT that it becomes twice as long. It's not like data in such a pipelined execution unit stop flowing just because the scheduler doesn't keep pushing additional data into it.

Well in Xenon, at least, any given pipeline, e.g. vector float, can be issued with an odd-thread calculation and then an even-thread calculation on the next clock.

Each thread in Xenon can be thought of as running on a 1.6GHz CPU.

How does "Hyperthreading" fail to match the definition of SMT

I was thinking of the asymmetry - which I now realise is not right, sigh. Starting off with Xenon being symmetric and going downhill from there

Jawed

arjan de lumens · Feb 23, 2006

Jawed said:
Well in Xenon, at least, any given pipeline, e.g. vector float, can be issued with an odd-thread calculation and then an even-thread calculation on the next clock.

Each thread in Xenon can be thought of as running on a 1.6GHz CPU.

In that case, one 14-step pipeline @3.2 GHz would, to the programmer, look like two more-or-less-independent 7-step pipelines @1.6 GHz. In both cases, the pipeline requires the same amount of physical time - about 4.3 nanoseconds - to complete its calculation. There is no reason why running such a pipeline in non-SMT mode would halve that time.

That arrangement, where you toggle threads every clock cycle, cycling through threads in a fixed pattern, is also sometimes referred to as a "barrel processor". It's quite easy, but rather odd, to implement 2-way SMT with a 2-way barrel, in that in case of e.g. a cache miss, you cannot shift execution resources over to the other thread while the thread with the miss is stalled.

Jawed · Feb 23, 2006

arjan de lumens said:
In that case, one 14-step pipeline @3.2 GHz would, to the programmer, look like two more-or-less-independent 7-step pipelines @1.6 GHz. In both cases, the pipeline requires the same amount of physical time - about 4.3 nanoseconds - to complete its calculation. There is no reason why running such a pipeline in non-SMT mode would halve that time.

Ultimately I think it's "double-length" because it suits the broader constraints of the Xenon architecture (limited cache, in-order...)

Instead of being greedy and starting the comparison with the 14-stage DP4 in Xenon, I should have kept it simple with the 6-stage SPE MAD. Ah well, never mind

That arrangement, where you toggle threads every clock cycle, cycling through threads in a fixed pattern, is also sometimes referred to as a "barrel processor". It's quite easy, but rather odd, to implement 2-way SMT with a 2-way barrel, in that in case of e.g. a cache miss, you cannot shift execution resources over to the other thread while the thread with the miss is stalled.

In Xenon it's an option - it could issue odd and even clocks from one thread - it's up to the developer to choose what suits their memory access patterns.

Xenon, if it's "set-up" with a pre-fetch (i.e. the dev has to explicitly code for this possibility), can avoid flushing the second thread if the first thread stalls on a miss.

Xenos seems to be more explicitly a barrel processor - seemingly with an AAAABBBB pattern (rather than ABABABAB). But that's not documented

Jawed

arjan de lumens · Feb 23, 2006

Jawed said:
Ultimately I think it's "double-length" because it suits the broader constraints of the Xenon architecture (limited cache, in-order...)

Instead of being greedy and starting the comparison with the 14-stage DP4 in Xenon, I should have kept it simple with the 6-stage SPE MAD. Ah well, never mind

While I don't think it is double-length per se, it could well be the case that the Xenon designers felt that they could relax the timing a bit from all-out-maximum-possible-performance-at-all-costs, given that the design has 2 threads and that the most common use cases for DP4 usually involve large numbers of independent DP4's after each other, reducing the need for extremely-low-latency operation. There could also be issues with logic reuse between DP4 and MADD; such reuse, while saving large amounts of die space, has a tendency to result in circuits that are suboptimal for either task (this is especially true if they tried to reuse the FP addition stages).

In Xenon it's an option - it could issue odd and even clocks from one thread - it's up to the developer to choose what suits their memory access patterns.

Xenon, if it's "set-up" with a pre-fetch (i.e. the dev has to explicitly code for this possibility), can avoid flushing the second thread if the first thread stalls on a miss.

Which would make it a "super-threading" architecture. Having the capability to tune the scheduling algorithm is a bit interesting. But one question: if it is set up with "pre-fetch" and one thread suffers a cache miss, does that thread continue to "execute" an idle-cycle once every 2 cycles for the whole duration of the cache miss?

Xenos seems to be more explicitly a barrel processor - seemingly with an AAAABBBB pattern (rather than ABABABAB). But that's not documented

AFAIK, non-barrel processing elements aren't very common in GPUs.

What can be defined as an ALU exactly?

andypski

Xmas

Porous

Geo

Mostly Harmless

KimB

trinibwoy

Meh

Jawed

nelg

arjan de lumens

Jawed

arjan de lumens

Jawed

Jawed

arjan de lumens

Jawed

arjan de lumens

Similar threads