Clarification over pipelines and stuff...

Neeyik

Homo ergaster
Veteran
Something has been bugging me all day long now and despite some poking around in textbooks and websites, I'm none the wiser. It concerns the pipelines and various units in a graphics chip.

Let's take the example of a GeForce4 (NV25) chip. It has 4 pipelines, with 2 "TMUs" for each one. It also "has" two vertex shader units and a single pixel shader unit. How does this all link together? I was under the impression (which knowing my luck is probably so far off, it's like looking at Pluto through a pair of binoculars) that the vertex shader unit was simply the part of the graphics processing chip that did all the geometry stuff - you'd either program the app using the "standard hardware TnL" rendering pipeline, which would let the drivers handle how the VS unit crunches the data, or you would use vertex shader routines to explicitly control what the hardware is actually doing. Is this right?

If so, how do the dual VS units of the GF4 work with 4 pipelines? I know these operate in parallel so is it a case of when a pipeline has data in the geometry stage, it requests the use of a free VS unit? With something like a GF2 would this mean that each pipeline has to take it in turn to use the one-and-only VS unit?

How do the TMUs and pixel shader unit relate to each other? Is the latter solely used for the mathematics operations in PS routines, such as blending values, etc? How do the TMUs and PS unit work together or are they the same thing?

I need an aspirin...
 
Neeyik said:
If so, how do the dual VS units of the GF4 work with 4 pipelines? I know these operate in parallel so is it a case of when a pipeline has data in the geometry stage, it requests the use of a free VS unit? With something like a GF2 would this mean that each pipeline has to take it in turn to use the one-and-only VS unit?

You should think VS pipelines and rasterization pipelines as different processors working on the same core.
After vertices are transformed they end in a on chip buffer. Its purpose is to store temporary transformed vertices and to hidden latency and pipeline bubbles. A primitives setup processor take vertices from this buffer and build up primitives (point, lines, triangles..) according a predefinite scheme (lists, fans, strips, indexed strips,...) and setup them calculating all the info the rasterizer needs to put on the screen the primitive. (edge equations, offsets,....). Then the rasterizer start to 'walk across' the primitive extracting pixels to work on according some rasterization order scheme that maximizes texture cache hits and minimize dram pages crossing. So it could be that a single unit processor doesn't know anything about what it happens in another processors. Buffers, caches and fifos are places between processors linking them and effectively hiding pipeline bubbles and whatever can stall pipelines work.
Pixel pipelines dont need to make requests to VS pipelines.

How do the TMUs and pixel shader unit relate to each other? Is the latter solely used for the mathematics operations in PS routines, such as blending values, etc? How do the TMUs and PS unit work together or are they the same thing?

I really don't know, but a PS unit should match a single pixel pipeline.
A TMUs should just be a block of a PS, a block that provides filtered texels. The PS pipeline itself would operate on data provided by TMUs producing the so-called fragments.

ciao,
Marco
 
Thanks for the comments nAo but it still doesn't completely solve my query...
After vertices are transformed they end in a on chip buffer.
This I sort of knew about, but my question concerns the processing of the vertices before they data is buffered for setup. With the GF4 example, you've got 4 pixel pipelines and 2 VS units, so do the 4 parallel pipelines share these units between them? (I'll assume from now on that "hardware TnL unit" is the same piece of silicon as the VS unit). The reason why I'm persisting over this point is that if a PS unit is per pipeline (as per your suggestion) then why have 2 VS units shared over 4 pipelines when you've got 4 PS units per pipe. Yes I know they operate independently but I'm just after a concrete understanding on this. ;)
 
With the GF4 example, you've got 4 pixel pipelines and 2 VS units, so do the 4 parallel pipelines share these units between them?

In one sense, yes – but remember as nao said before they are really separate processes.

The 4 pixel pipelines on GF4 don’t really care about the ‘VS’ units themselves, all they care about is the triangle data that’s in the buffer. The Pixel pipelines work on a per triangle basis; they render all the pixels on a triangle before requesting the next triangle to render – the pixel pipes have no concern over where those triangles came from.
 
Neeyik said:
With the GF4 example, you've got 4 pixel pipelines and 2 VS units, so do the 4 parallel pipelines share these units between them?

No, there is no sharing. Like I wrote PSs don't need to know anything about VSs. PSs just just fill primitives, they don't even know what kind of primitive they're filling.
There is not something like a VS coupled with a PS. A single VS don't transform all the vertices shared by a single primitive.
VSs are just stream machine and don't know anything about the data and the primitives they're working on.

The reason why I'm persisting over this point is that if a PS unit is per pipeline (as per your suggestion) then why have 2 VS units shared over 4 pipelines when you've got 4 PS units per pipe. Yes I know they operate independently but I'm just after a concrete understanding on this. ;)

Pixel pipes work alltogether on the same primitive, the hw doesn' t need to assign some VSs to some pixel pipe.
While pixel pipes fill a primitive, VSs just crunch vertices, without any knowledge shared between them.

ciao,
Marco
 
Now I'm just plain confused :eek: (pass me a bucket of aspirin)...

What exactly then is the "4 pipelines" in the spec sheet for a GF4/R200 actually referring to? Does this only having any meaning after the geometry unit, ie. one or two vertex crunchers transform the data, which then undergoes triangle (or whatever) setup, stuffing the results into a buffer which then feeds 4 pixel pipelines (which have PS units and TMUs attached to each one)? Or am I barking up the completely wrong tree again?
 
Neeyik said:
Now I'm just plain confused :eek: (pass me a bucket of aspirin)...

What exactly then is the "4 pipelines" in the spec sheet for a GF4/R200 actually referring to? Does this only having any meaning after the geometry unit?

Yeah..the four pixel pipes 'live' only after VS and primitive setup engines.
 
If any of you are wondering what that almighty CLANG, that's just reverberated around the globe, was....it was the penny that dropped in my head!

Thanks to nAo and Wavey! I am soooooooo much happier now 8)
 
Just one thing to clarify...
there is no such thing as a "PS unit" that exists independently of the pixel pipelines. In current DX8 chips each pipeline is PS capable. They consist of TMUs that fetch and filter texels and of combiners that do arithmetic operations. If this pipeline is able to perform a certain set of operations as defined in the DX specs, it can be called "Pixel Shader pipeline".

Since GF3/4 only have 2 TMUs per pipeline and AFAIK use pipeline combining if more textures are required, one can argue that they only have two fully PS capable pipelines. But thats a moot point, it's not the number of pipelines but the fill rate that matters.
 
So VS can be replaced by CPU?

Just wondering if VS(no matter 1.1 or 2.0) can be fully implemented in CPU thru D3D? If so, then what Trident/SiS doing really makes sense..
 
Xmas said:
Since GF3/4 only have 2 TMUs per pipeline and AFAIK use pipeline combining if more textures are required, one can argue that they only have two fully PS capable pipelines. But thats a moot point, it's not the number of pipelines but the fill rate that matters.

Actually GF3/4 use loopback instead of pipeline combining.

Every pipeline in a GF3/4 contains 2 texel shader units and 2 register combiner units (as exposed trough their OpenGL extensions).
The texels shader is a high precision vector arithmetic unit + a TMU.
The register combiner is a low precision vector arithmetic unit.
To implement the PS1.1/1.3 requirement the GF3/4 card uses up to 4x loopback (results as 1/4 fillrate) to implement up to 4 texture operations, and up to 8 color/alpha operations.
 
Actually GF3/4 use loopback instead of pipeline combining.

Theres some contention over this.

Abrash's XBox article stated that it NV2A (and hence we assume NV20/25) will use Loopback for use of 4 textures, however and NVIDIA engineer on the old boards said that it would use pipeline combining if a Pixel Shader opartion needs 8 register combiners, IIRC.

So, there is a possability that it can do both, dependant on the operation its required to do.

To implement the PS1.1/1.3 requirement the GF3/4 card uses up to 4x loopback

No. It can only 'loopback' once (i.e. 2 cycles) - to have 4 cycle loopback would require 8 register combiners per pipe, AFAIK, which would be too much for 4 pipes (ATI managed 6 somehow).
 
DaveBaumann said:
To implement the PS1.1/1.3 requirement the GF3/4 card uses up to 4x loopback

No. It can only 'loopback' once (i.e. 2 cycles) - to have 4 cycle loopback would require 8 register combiners per pipe, AFAIK, which would be too much for 4 pipes (ATI managed 6 somehow).

Why?
It is this loopback capability which allows it to do 8 "register combining operations" with only 2 units per pipe.
 
It is this loopback capability which allows it to do 8 "register combining operations" with only 2 units per pipe.

Well, the question I would put to you there is 'If this is the case why is NV20/25 limited to 4 textures (hence 1 'loopback') per pass?'
 
Oh, I just realised that both of us might be right. :)

nVidia does the texture operations completely separate to the color/alpha operations.

So it has two TMU's but it can provide a loopback (up to 2 cycle) to process up to 4 textures.
Then all data goes to the register combiners that can do 2 combiner operations per cycle (supporting up to 4 cycle = 8 operations).
So if you use more than 2 operations with up to two textures, or more than 4 operations with up to four textures, it simply stalls the texture pipeline.

There are many reasons not to support more textures: more registers needed. While the execution units are looping their input/output have to be kept in buffers. More textures: more data have to be buffered for a longer time.

I cannot find the nVidia documentation that I read about the pixel-shader speed, but it clearly stated that using 5 or 6 color/alpha instructions results in 1/3 fillrate, which makes pipeline combining very unlikely.
 
Ok it seem I might be wrong (or the doc I read).
I wrote a little benchmark program to dest the slowdown caused by longer PS programs.
I used the program:
Code:
	ps.1.1
	tex t0

	mov r0, t0
	add r0, r0, r0
	add r0, r0, r0
	add r0, r0, r0
	add r0, r0, r0
	add r0, r0, r0
	add r0, r0, r0
	add r0, r0, r0

I've changed the number of color instruction between 1 and 8. The texture was mapped so all vertices had (0, 0) texture coordinates for maximum cache hit. Z-buffer was disabled. I even tried disabling the color-writes but it didn't make any difference after all this.

Results:
Code:
1 op		792 Mpixel/s
2 ops		792 Mpixel/s
3 ops		397 Mpixel/s
4 ops		331 Mpixel/s
5 ops		198 Mpixel/s
6 ops		198 Mpixel/s
7 ops		149 Mpixel/s
8 ops		149 Mpixel/s

It seems like 1/4 speed for 5-6 operations rather than 1/3 as the doc said.
I have no idea why I'm unable to get clean results for 4,7 and 8.

More tests will follow this one...
 
Back
Top