Fillrate & Shading power

Bjorn

Veteran
This has probably been discussed before here, but i'm going to ask this question anyway :)

There are some talk about the NV40 and how many pipelines it would have in the Valve Half Life 2 Presentation and Benchmarks thread. And i thought that as it is now, (with the Half Life 2 benchmarks and all) we don't really need more fillrate. Or rather, there are other things that we need much more. Mainly shading power.

What i want to know then is, how deeply connected are the pipelines with the shader units ? Do we need more pipelines to get more shading power ?

F.e, could you double the shading units of the 9600 (while keeping the pipeline configuration) and get a 9800's shading power/MHz ?

And if that's the case, would that be a good solution afa transistor budget goes ?
 
AFAIK shading power is purely based on fillrate alone.
I think that may be because there is 1 shading unit per pixel pipe? I can be wrong.
 
It depends on how shaders are written, I suppose. I'm thinking of a more "powerful" shader (per pipe) like a 4x2 rather than a 4x1 pixel/texture architecture. Shaders are probably far different from textures in that they're likely not as memory-dependent and require far more passes, so a "4x2" pixel/shader architecture may be better than a "4x1" one pretty much always.

But I thought NV40 will use a unified PS/VS architecture, which may further remove it from the shader-integrated-into-pipeline conception I have of the R3x0 (and R2x0, and NV2x). For instance, will it maintain the "sea of shaders" form that NV3x has, and is that disconnected from the pixel pipelines, fundamentally different than the R3x0?

(I'm talking out of my arse, here. Just filling space before the pros get here. :))
 
Pete said:
I'm thinking of a more "powerful" shader (per pipe) like a 4x2 rather than a 4x1 pixel/texture architecture.

You mean more arithmetic units per pipeline?
It's certainly possible.

If shader programs will get longer there will no longer be a benefit from increasing the peak pixel troughput of the card (sometimes this is referred as "fillrate")

NV35 does this - the fact it doesn't perform is not a problem with the concept - rather the implementation.

Shaders are probably far different from textures in that they're likely not as memory-dependent and require far more passes, so a "4x2" pixel/shader architecture may be better than a "4x1" one pretty much always.

Surely shaders are different from textures. It's like saying a recipe is different from a tomato. :)

Shaders are programs consisting mainly of two operations: texture lookups and arithmetic ops.
So what you say that the aritmetic ops / texture lookup ratio will go up?
I agree.
Will this use reduced reduced bandwidth per ops?
Not necessarily!
Dependent texture lookups can have far worse texture cache coherency than progressive texture lookups.
Also there are more then 32bit/texel formats available. (At least on ATI cards...)

Oh, and passes in 3d rendering doesn't mean what you wanted to say...

But I thought NV40 will use a unified PS/VS architecture, which may further remove it from the shader-integrated-into-pipeline conception I have of the R3x0 (and R2x0, and NV2x). For instance, will it maintain the "sea of shaders" form that NV3x has, and is that disconnected from the pixel pipelines, fundamentally different than the R3x0?

(I'm talking out of my arse, here. Just filling space before the pros get here. :))

It's not the PS/VS integration it's one particular PS3.0 feature that will require a serious redesign of the pixel architecture.
Once that problem is solved (with a high performance solution!) IMHO PS/VS integration will be easy to do.
 
K.I.L.E.R said:
AFAIK shading power is purely based on fillrate alone.
Not exactly. The fillrate of a vpu is determnied by the number of pixels (non-textured, single textured, shaded, etc.), which can be written (to one of the many on-chip buffers) per-clock. In today's vpu world a "pipeline" is, generally, in charge of carrying-out this task (writing the necessary pixels to-be-drawn), however as of late, we know that pipelines the likes of those found in NV3x and R3xx not only contain the required shader units for single vector+scalar (R3xx) op in parallel, but they are deep enough to accomodate for other ops in the same clock by pipelining a number of shader units, vertically, within each pixel pipeline; the units are structured in a conveyer belt manner to mask of required ops more often.

Try to imagine a set of 4/8 conveyer belts (aka number of pipelines)operated utilized in an assembly line with 1 employee (alu used by the pipeline). Now, compare this set to one run by 2 employees per belt. If a required task calls for a one-step procedure, the two work sets (vpu's) will output whatever they're producing at the same rate: 1 item/cycle (1 proxel ;), or shaded pixel/fragment per cycle). However, if two steps are required for another task, the work set with two employees per belt will maintain their single-step output while the other set will require another pass.

Therefore, shading power is not determined derictly by pixel fillrate, but by proxel fillrate, which is defined, by demalion, as the rate of shaded fragments output per clock. Remember that a fragment is not necessarily a completely finished pixel, it could very wll be the first shaded layer (similar to a texel in relation to a pixel).
 
I don't think adding more units in a serial fashion has any future.

Think Parhelia (4x4 design). In single textured (bilinear) stuff, 75% of the fragment resources just sit idle and do nothing. Wasted transistors.
The most efficient use of transistors is to provide 'little' units that can have a lot of loopback if required, because you can easily achieve 100% utilization of all the resources you've spent in the first place.
There's also a very real desire to make trivial shaders as fast as they can be. Witness ATI's transition from 2x3 to 4x2 to 8x1. Or the 'zixel' optimization in NV3x. Or take the IHVs' reluctance to adopt single cycle trilinear capable texture samplers. As long as simple tasks make up a good deal of the work required for rendering a scene, it'll be beneficial to make them fast (as long as it doesn't impede execution speed on more complex tasks - in general, it doesn't).

This does not immediately lead to fast operation on complex shaders, but it does increase average bang per transistor. And the transistors you've just saved can then be spent elsewhere.
 
If shader programs will get longer there will no longer be a benefit from increasing the peak pixel troughput of the card (sometimes this is referred as "fillrate")

Aren't we already at this point ?

Cause we know that Half Life 2, a first gen DX9 game, is almost exlusively shader lmited afa the graphics pipelines goes.
And i'm guessing that this will only get worse.
 
zeckensack said:
I don't think adding more units in a serial fashion has any future.

No it does not, but it doesn't have to be in a serial fashion.

Even a RV350 is surely work on much more than 4 pixel at a time.
You could double the units, and double the number of pixels in flight for double performance.
That wouldn't double the pixel troughput, so it would be cheaper than what the R300 is.

The reason for ATI to choose an design with 8 pixel/clock troughput for the R300 was probably because they wanted to ensure highest possible score in "legacy" applications.
 
Hyp-X said:
The reason for ATI to choose an design with 8 pixel/clock troughput for the R300 was probably because they wanted to ensure highest possible score in "legacy" applications.

Do you believe R300 8 pixel pipes work toghether on the same triangle? I don't...

ciao,
Marco
 
For people wondering what proxels are, click the not-so-secret "handshake" for proxel discussion: :p.

Short version: a "modern pixel" that allows familiar terms such as fillrate to retain usefulness for games growing more dependent on shader processing.

...

The analysis and standardization would be the complicated part. The "proxel" idea elements are to allow the results of that to communicate something meaningful to consumers in a more familiar "fillrate" (and "pipeline", if you dig deeper into my proxel discussion) context (pixel->texel->proxel).

The idea was for a common frame of reference for multiple shader benchmarks that were very focused on examining one characteristic of the GPU...like pixel and texel fillrate tests, except with a bit more information conveyed. "Min" and "Max" fillrates would be relatively fixed, while "Standardized" would be bracketed between them, and where compiler improvements should manifest.

Short version: Click.

...

Of course, cheating can distort this picture, as it distorts results for individual benchmarks used, but building a common and benchmark independent framework for each benchmark to contribute to, while making that more accessible to consumers for contrast and comparison, can make cheating harder to hide.
 
nAo said:
Do you believe R300 8 pixel pipes work toghether on the same triangle? I don't...

Yes, just not on the same 16x16 screen tile.
The screen is divided into tiles half of which belong to the 1st unit (4 pipelines), half of which belong to the 2nd unit.
The rasterizer dispatches 4 pixel blocks to the 2 units that are executing completely separately.
The execution time can be different between the units (because of texturing delays), so it's possible that one starts to process pixels from the next triangle while the other units are working on the previous one, but this is not the common case.

Of course all this is guesswork, but there are clues that point in this way...
 
Hyp-X said:
The execution time can be different between the units (because of texturing delays), so it's possible that one starts to process pixels from the next triangle while the other units are working on the previous one, but this is not the common case.
I hope this IS the common case :)
With tiles so big, if the 2x2 units can't fill different primitives at the same time, most of time one will sit idle. Just think to a scene with a lot of small triangles..

ciao,
Marco
 
nAo said:
Just think to a scene with a lot of small triangles..

You are right in this case it will happen fairly often.

My point was that the work is distributed based on screen location and not by triangles. The alternative would be: dispatch triange 1 to unit 1, dispatch triange 2 to unit 2, dispatch triange 3 to the unit that finishes first, etc.
Of course you didn't say that but it sounded like that.
 
Hyp-X said:
You mean more arithmetic units per pipeline?
It's certainly possible.
Indeed I do. In the future, I'll either have to be less lazy when typing, or you'll have to develop your latent psychic abilities so you can read my mind. ;)

If shader programs will get longer there will no longer be a benefit from increasing the peak pixel troughput of the card (sometimes this is referred as "fillrate")

[...] Surely shaders are different from textures. It's like saying a recipe is different from a tomato. :)

Shaders are programs consisting mainly of two operations: texture lookups and arithmetic ops.
So what you say that the aritmetic ops / texture lookup ratio will go up?
:D

Yes, that was my point. I see you've got the psychic thing working already. :)
I agree.
Will this use reduced reduced bandwidth per ops?
Not necessarily!
Dependent texture lookups can have far worse texture cache coherency than progressive texture lookups.
Also there are more then 32bit/texel formats available. (At least on ATI cards...)
Interesting, and point taken.

Oh, and passes in 3d rendering doesn't mean what you wanted to say...
Yes, I meant clock. I was thinking games don't really have more than a handful of textures per pixel, so a 4*2 pixel*texture architecture can be significantly less effective than an 8*1 one if a lot of pixels have an odd number of textures. OTOH, I'm guessing shaders require more than a few cycles to run, so the distinction between fewer but more powerful shaders and more, less-efficient-per-clock shaders may be smaller than that between a 4*2 vs. 8*1 architecture. I'm probably oversimplifying to the point of meaningless reduction, though.

It's not the PS/VS integration it's one particular PS3.0 feature that will require a serious redesign of the pixel architecture.
Once that problem is solved (with a high performance solution!) IMHO PS/VS integration will be easy to do.
I'm not very familiar with CPU architecture, so if that one feature of PS3.0 is related to branching or looping or both, further elaboration on the hardware implementation of that would probably take too much time to clarify. That's OK, as I've learned enough from this thread so far.

Thanks for taking the time to correct me, Hyp-X.
 
Hyp-X said:
The alternative would be: dispatch triange 1 to unit 1, dispatch triange 2 to unit 2, dispatch triange 3 to the unit that finishes first, etc.
Of course you didn't say that but it sounded like that.
That isn't the only alternative.
When I wrote that each unit rasterize a different triangle I thought about non-overlapping triangles. Obviously, with this per tile approach the units can render non overlapping portions of the same triangle, or different triangles.
Unfurtunately I don't know how R300+ architecture works in this regard.
At the last year Graphics Hardware in Germany I asked to an ex-artx guy (now at ATI) about that and he politely told me he couldn't share any info about this topic :(

ciao,
Marco
 
Oh, I forgot to say that the 16x16 tiles are assigned to the units in a checkerboard pattern.
I just wrote a little test program to prove my theory.
If I arrange stuff to only use half of the checkerboard I get half fillrate/shader speed.
 
Wasn't this sovered in the R3*0 new inf thread?

Full fledge ALUs have more features/functions, consequently upgrading the mini to full status will probably end up doubling up performance of ALU bound items for all cases, not just the current set.

:?: :?:
 
Back
Top