pixel shader pipeline parallelism restriction problem.

991060

Regular
I've just read a whitepaper named "Xbox Pixel Shader Performance" which you can find in the latest XBOX SDK.
here's some quote :
First, on any one clock, the four pixel shader pipelines can only work on pixels belonging to a single triangle. This means that if you were to draw, say, four 1-pixel triangles in a row, it would take a minimum of 4 clocks, and three of the pixel shader pipelines would remain idle during each clock, unable to work in parallel with the one active pixel shader pipeline because there are no other pixels to work on in the current triangle.

The restriction is that the four pixel shader pipelines can draw to only one quad on any given clock. No matter whether 1, 2, 3, or 4 pixels of a given quad are covered by a triangle, that quad will tie up all four pixel shader pipelines while its being drawn.

the two pixel-shader-pipeline parallelism restrictions can be rolled into a single sentence: The four pixel shader pipelines can draw in parallel only to a single quad of a single triangle.

I'm wondering if such restriction also apply to current PC GPU product,such as NV3X and R3XX.
 
And, if such restrictions do exist, where do they come from?

I think the single triangle restriction is due to the fact that the PS pipelines need data which is interpolated across the triangle, so shading pixels in two triangle doesn't make sense. please correct me here if I'm wrong.

So what about the single quad restriction? Accoording to the "NV30 inside" article published on 3dcenter.org, NV30 does have such restriction. what about R300/350? they have 8 Pixel shader pipeline, so 2 quads at one time?
 
991060 said:
And, if such restrictions do exist, where do they come from?
Probably the need to do 2x2 pixels at a time in order to implement the "d/dx" and "d/dy" instructions. (Those aren't the correct names but I'm too lazy to look it up).
 
991060 said:
what about R300/350? they have 8 Pixel shader pipeline, so 2 quads at one time?

Yes, 2 quads.
The 2 quads are processed independently, can take different processing time or belong to different triangles.
There's a 16x16 tile checkerboard pattern one of the units is processing the "black" tiles, the other the "white" tiles.
 
thanks Simon F and Hyp-x, those comments are helpful though I don't quite understand the "16x16 pattern" thing :D

Another question: With the advent of DFC, it's quite possible that different pixels within a single quad need different processing time, is it safe to say that the processing time a quad needs is that of its slowest pixel? If this is true, I think the parallelism is reduced with more pixel shader pipelines reside in a single GPU because the possibility that all pixels which are processed at one time need equal processing time is decreasing very quickly with more pixel shader pipelines. Is it possible that IHVs design their hardware to assign each pixel a independent pipeline rather than assigning 4 pipelines to a quad? how about this approach's efficiency?
 
These kind of things are known as 'granularity losses' (the smallest chunk that can be worked on is larger than the smallest chunk of useful data that could be desired).

It's rarely significant on very small triangles - these tend to be vertex limited. In theory it could be somewhat of a problem on long (100s of pixels) and skinny (~1 pixel) triangles, but I've never actually seen a problem case.

I don't like the terminology used below. 'The four pixel pipelines...' is a misnomer because it implies independence. It is one quad pipeline. (Regular readers will have heard this rant before).
 
991060 said:
thanks Simon F and Hyp-x, those comments are helpful though I don't quite understand the "16x16 pattern" thing :D

My guess is that it's probably just ATI's way of avoiding concurrency issues . For example imagine you have two polys that overlap at pixel, P. if you had independent pipeline blocks that can write to any pixel, you don't a situation where the first poly runs a slow shader on Pipeline A then later goes and overwrites P from the second poly because it happened to run a fast shader on Pipeline B.

Another question: With the advent of DFC, it's quite possible that different pixels within a single quad need different processing time, is it safe to say that the processing time a quad needs is that of its slowest pixel? If this is true, I think the parallelism is reduced with more pixel shader pipelines reside in a single GPU because the possibility that all pixels which are processed at one time need equal processing time is decreasing very quickly with more pixel shader pipelines.
Yes this is true in theory, but unlikely to be a problem in practice. If each pixel in the block were to do completely different things, then it's likely to alias like b*ggery. :)

Dio said:
I don't like the terminology used below. 'The four pixel pipelines...' is a misnomer because it implies independence. It is one quad pipeline. (Regular readers will have heard this rant before).
Trouble is "quad" is often used for 4 sided polys. Perhaps we should call this a "pixel quartet"?
 
The restriction is most probably needed to calculate texture gradients, like the way dsx/dsy work. These gradients are then used to compute the mipmap level.

This way you can do whatever transformation on the texture coordinates, the mipmap level will still be calculated correctly. This avoids overblur and aliasing that would otherwise occur with 'linear' mipmap level interpolation. That latter method also requires more operations especially when using adaptive anisotropic filtering. Should save some silicon...
 
Simon F said:
Trouble is "quad" is often used for 4 sided polys. Perhaps we should call this a "pixel quartet"?
We're stuck with quads, I fear...

If we can't get away from TMU's and pixel pipelines, anything you or I choose to do is not likely to make much difference :)
 
Dio said:
Simon F said:
Trouble is "quad" is often used for 4 sided polys. Perhaps we should call this a "pixel quartet"?
We're stuck with quads, I fear...

If we can't get away from TMU's and pixel pipelines, anything you or I choose to do is not likely to make much difference :)

Quads could disappear at the point that the majority of apps are heavily using dynamic flow control making dsx/dsy approach to lod calc somewhat less useful than it is today...

John.
 
JohnH said:
Quads could disappear at the point that the majority of apps are heavily using dynamic flow control making dsx/dsy approach to lod calc somewhat less useful than it is today...
Do you know of any efficient alternative?

The only method I know that really makes gradient calculations independent is to fully shade 3 texture coordinates per pixel (arranged in a 1x1 pixel triangle). The cost of this is of course comparable to 3x supersampling but with shader analysis it could be reduced a lot?
 
Nick said:
JohnH said:
Quads could disappear at the point that the majority of apps are heavily using dynamic flow control making dsx/dsy approach to lod calc somewhat less useful than it is today...
Do you know of any efficient alternative?

The only method I know that really makes gradient calculations independent is to fully shade 3 texture coordinates per pixel (arranged in a 1x1 pixel triangle). The cost of this is of course comparable to 3x supersampling but with shader analysis it could be reduced a lot?

Worst case you need to provide shader code to generate dsx and dsy for your specific shader function, these might be simple e.g. just based on A,B from a plane eqn, but could equally be something rather lardy. The key is that if you do these sort of things the HW isn't going to be able to do the work directly anymore.

This is of course the price you pays...

John.
 
Dio said:
Simon F said:
Trouble is "quad" is often used for 4 sided polys. Perhaps we should call this a "pixel quartet"?
We're stuck with quads, I fear...

If we can't get away from TMU's and pixel pipelines, anything you or I choose to do is not likely to make much difference :)

Hmm... *thinks a bit*
Does that mean R500 is less advanced than NV50?
Or am I just reading too much in that sentence?

Or maybe am I overestimating the NV50? I doubt that though.


Uttar
 
Uttar said:
Dio said:
Simon F said:
Trouble is "quad" is often used for 4 sided polys. Perhaps we should call this a "pixel quartet"?
If we can't get away from TMU's and pixel pipelines, anything you or I choose to do is not likely to make much difference :)
Or am I just reading too much in that sentence?
I meant 'If we can't escape the archaic uses of terminology like TMU's and pixel pipelines...'

Does that clarify things? :)
 
nelg said:
Uttar, I am going to ask that Dave bans you from B3D until you produce that editorial :!: ;)

ROFL! :LOL:

Hey, if I'm being slower than for most of my other writing, is that I want this to be high quality and very informative.
It's not I'm not working on it. It's just that there is thus a lot more "overhead" ( NV30 anyone? ;) ) than usual.

That includes a professional technical writer correcting it in his free time and sources commenting on it, to make sure I ain't making any major mistakes and that the overall message is indeed correct.

Now, if one of the sources just told me "Err, no, that all this just doesn't seem to be true at all", I'd just scrap the whole thing and start again with other goals. No kidding.

7 full A4 pages in Times New Romans, size 12, written already. Current goal is around 10 pages.

ETA is 20-25 October. Before releasing it, I also need to make sure it's released before or after an official launch, to make sure its existence isn't forgotten due to discussions about I don't know what awfully boring fall refresh :p

Dio: Lol, okay. I did read too much in that sentence I guess :)


Uttar
 
Back
Top