pixel shader pipeline parallelism restriction problem.

991060 · Oct 15, 2003

I've just read a whitepaper named "Xbox Pixel Shader Performance" which you can find in the latest XBOX SDK.
here's some quote :

First, on any one clock, the four pixel shader pipelines can only work on pixels belonging to a single triangle. This means that if you were to draw, say, four 1-pixel triangles in a row, it would take a minimum of 4 clocks, and three of the pixel shader pipelines would remain idle during each clock, unable to work in parallel with the one active pixel shader pipeline because there are no other pixels to work on in the current triangle.

The restriction is that the four pixel shader pipelines can draw to only one quad on any given clock. No matter whether 1, 2, 3, or 4 pixels of a given quad are covered by a triangle, that quad will tie up all four pixel shader pipelines while its being drawn.

the two pixel-shader-pipeline parallelism restrictions can be rolled into a single sentence: The four pixel shader pipelines can draw in parallel only to a single quad of a single triangle.

I'm wondering if such restriction also apply to current PC GPU product,such as NV3X and R3XX.

991060 · Oct 15, 2003

And, if such restrictions do exist, where do they come from?

I think the single triangle restriction is due to the fact that the PS pipelines need data which is interpolated across the triangle, so shading pixels in two triangle doesn't make sense. please correct me here if I'm wrong.

So what about the single quad restriction? Accoording to the "NV30 inside" article published on 3dcenter.org, NV30 does have such restriction. what about R300/350? they have 8 Pixel shader pipeline, so 2 quads at one time?

Simon F · Oct 15, 2003

991060 said:
And, if such restrictions do exist, where do they come from?

Probably the need to do 2x2 pixels at a time in order to implement the "d/dx" and "d/dy" instructions. (Those aren't the correct names but I'm too lazy to look it up).

Hyp-X · Oct 15, 2003

991060 said:
what about R300/350? they have 8 Pixel shader pipeline, so 2 quads at one time?

Yes, 2 quads.
The 2 quads are processed independently, can take different processing time or belong to different triangles.
There's a 16x16 tile checkerboard pattern one of the units is processing the "black" tiles, the other the "white" tiles.

991060 · Oct 15, 2003

thanks Simon F and Hyp-x, those comments are helpful though I don't quite understand the "16x16 pattern" thing

Another question: With the advent of DFC, it's quite possible that different pixels within a single quad need different processing time, is it safe to say that the processing time a quad needs is that of its slowest pixel? If this is true, I think the parallelism is reduced with more pixel shader pipelines reside in a single GPU because the possibility that all pixels which are processed at one time need equal processing time is decreasing very quickly with more pixel shader pipelines. Is it possible that IHVs design their hardware to assign each pixel a independent pipeline rather than assigning 4 pipelines to a quad? how about this approach's efficiency?

Dio · Oct 15, 2003

These kind of things are known as 'granularity losses' (the smallest chunk that can be worked on is larger than the smallest chunk of useful data that could be desired).

It's rarely significant on very small triangles - these tend to be vertex limited. In theory it could be somewhat of a problem on long (100s of pixels) and skinny (~1 pixel) triangles, but I've never actually seen a problem case.

I don't like the terminology used below. 'The four pixel pipelines...' is a misnomer because it implies independence. It is one quad pipeline. (Regular readers will have heard this rant before).

Simon F · Oct 15, 2003

991060 said:
thanks Simon F and Hyp-x, those comments are helpful though I don't quite understand the "16x16 pattern" thing

My guess is that it's probably just ATI's way of avoiding concurrency issues . For example imagine you have two polys that overlap at pixel, P. if you had independent pipeline blocks that can write to any pixel, you don't a situation where the first poly runs a slow shader on Pipeline A then later goes and overwrites P from the second poly because it happened to run a fast shader on Pipeline B.

Another question: With the advent of DFC, it's quite possible that different pixels within a single quad need different processing time, is it safe to say that the processing time a quad needs is that of its slowest pixel? If this is true, I think the parallelism is reduced with more pixel shader pipelines reside in a single GPU because the possibility that all pixels which are processed at one time need equal processing time is decreasing very quickly with more pixel shader pipelines.

Yes this is true in theory, but unlikely to be a problem in practice. If each pixel in the block were to do completely different things, then it's likely to alias like b*ggery.

Dio said:
I don't like the terminology used below. 'The four pixel pipelines...' is a misnomer because it implies independence. It is one quad pipeline. (Regular readers will have heard this rant before).

Trouble is "quad" is often used for 4 sided polys. Perhaps we should call this a "pixel quartet"?

Nick · Oct 15, 2003

The restriction is most probably needed to calculate texture gradients, like the way dsx/dsy work. These gradients are then used to compute the mipmap level.

This way you can do whatever transformation on the texture coordinates, the mipmap level will still be calculated correctly. This avoids overblur and aliasing that would otherwise occur with 'linear' mipmap level interpolation. That latter method also requires more operations especially when using adaptive anisotropic filtering. Should save some silicon...

Dio · Oct 15, 2003

Simon F said:
Trouble is "quad" is often used for 4 sided polys. Perhaps we should call this a "pixel quartet"?

We're stuck with quads, I fear...

If we can't get away from TMU's and pixel pipelines, anything you or I choose to do is not likely to make much difference

991060 · Oct 15, 2003

Thanks for all the replies, I'm beefed up by visiting here

JohnH · Oct 15, 2003

Dio said:
Simon F said:

Trouble is "quad" is often used for 4 sided polys. Perhaps we should call this a "pixel quartet"?

Click to expand...

We're stuck with quads, I fear...

If we can't get away from TMU's and pixel pipelines, anything you or I choose to do is not likely to make much difference

Quads could disappear at the point that the majority of apps are heavily using dynamic flow control making dsx/dsy approach to lod calc somewhat less useful than it is today...

John.

Nick · Oct 15, 2003

JohnH said:
Quads could disappear at the point that the majority of apps are heavily using dynamic flow control making dsx/dsy approach to lod calc somewhat less useful than it is today...

Do you know of any efficient alternative?

The only method I know that really makes gradient calculations independent is to fully shade 3 texture coordinates per pixel (arranged in a 1x1 pixel triangle). The cost of this is of course comparable to 3x supersampling but with shader analysis it could be reduced a lot?

JohnH · Oct 16, 2003

Nick said:
JohnH said:

Quads could disappear at the point that the majority of apps are heavily using dynamic flow control making dsx/dsy approach to lod calc somewhat less useful than it is today...

Click to expand...

Do you know of any efficient alternative?

The only method I know that really makes gradient calculations independent is to fully shade 3 texture coordinates per pixel (arranged in a 1x1 pixel triangle). The cost of this is of course comparable to 3x supersampling but with shader analysis it could be reduced a lot?

Worst case you need to provide shader code to generate dsx and dsy for your specific shader function, these might be simple e.g. just based on A,B from a plane eqn, but could equally be something rather lardy. The key is that if you do these sort of things the HW isn't going to be able to do the work directly anymore.

This is of course the price you pays...

John.

Arun · Oct 16, 2003

Dio said:
Simon F said:

Trouble is "quad" is often used for 4 sided polys. Perhaps we should call this a "pixel quartet"?

Click to expand...

We're stuck with quads, I fear...

If we can't get away from TMU's and pixel pipelines, anything you or I choose to do is not likely to make much difference

Hmm... *thinks a bit*
Does that mean R500 is less advanced than NV50?
Or am I just reading too much in that sentence?

Or maybe am I overestimating the NV50? I doubt that though.

Uttar

nelg · Oct 17, 2003

Uttar, I am going to ask that Dave bans you from B3D until you produce that editorial :!:

Dio · Oct 17, 2003

Uttar said:
Dio said:

Simon F said:

Trouble is "quad" is often used for 4 sided polys. Perhaps we should call this a "pixel quartet"?

Click to expand...

If we can't get away from TMU's and pixel pipelines, anything you or I choose to do is not likely to make much difference

Click to expand...

Or am I just reading too much in that sentence?

I meant 'If we can't escape the archaic uses of terminology like TMU's and pixel pipelines...'

Does that clarify things?

Arun · Oct 17, 2003

nelg said:
Uttar, I am going to ask that Dave bans you from B3D until you produce that editorial

ROFL!

Hey, if I'm being slower than for most of my other writing, is that I want this to be high quality and very informative.
It's not I'm not working on it. It's just that there is thus a lot more "overhead" ( NV30 anyone?

) than usual.

That includes a professional technical writer correcting it in his free time and sources commenting on it, to make sure I ain't making any major mistakes and that the overall message is indeed correct.

Now, if one of the sources just told me "Err, no, that all this just doesn't seem to be true at all", I'd just scrap the whole thing and start again with other goals. No kidding.

7 full A4 pages in Times New Romans, size 12, written already. Current goal is around 10 pages.

ETA is 20-25 October. Before releasing it, I also need to make sure it's released before or after an official launch, to make sure its existence isn't forgotten due to discussions about I don't know what awfully boring fall refresh

Dio: Lol, okay. I did read too much in that sentence I guess

Uttar

pixel shader pipeline parallelism restriction problem.

991060

991060

Simon F

Tea maker

Hyp-X

Irregular

991060

Dio

Simon F

Tea maker

Nick

Dio

991060

JohnH

Nick

JohnH

Arun

Unknown.

nelg

Dio

Arun

Unknown.

Similar threads