NV40: 6x2/12x1/8x2/16x1? Meh. Summary of what I believe

Xmas · Feb 8, 2004

Dio said:
Note that alpha test force Z-features off, which makes chips take a big hit too.

Huh? There's no reason to turn off Z optimizations when using alpha test.

Tridam · Feb 8, 2004

DaveBaumann said:
What about the flexiblity ? Well the flexibility of outputting twice the number of pixels when doing bilinear filtering won't exist if the memory bus isn't wide enough.

Click to expand...

Yes, but what if the next triangle to be operated on had 100+ long shader instruction program? If you have a sufficiently large buffer you can still output the bilinear filtered pixels from the buffer with no overall loss of performance because you are backed up by these pixels.

Well in this specific case the rendering speed of bilinear pixels will mostly be negligible without having a huge bilinear pixels rendering speed and a huge buffer (huge -> more than twice). Moreover at the "result level" it'll look as X improved pipelines and not as 2X simple pipelines.

DaveBaumann said:
The question is to assess where the largest bottlenecks are over a wider range of current and future titles.

Right. Actually we can generalize this in "The question is about the usefulness of features." This includes the marketing utility. If you have 16 pipelines and a 256 bit bus running at the GPU frequency you'll never be able to show that your GPU is a 16 pipelines one without using massive cheats.

DaveBaumann said:
Also, IIRC, the way R300 allocates quads means that it can frequently be operating on different triangles for each quad pipeline, and these may well have different requirements from each other.

I agree with you on this but that doesn't mean that being able to output more pixels than the number able to go through the memory bus is a good idea.

Dave Baumann · Feb 8, 2004

Well in this specific case the rendering speed of bilinear pixels will mostly be negligible without having a huge bilinear pixels rendering speed and a huge buffer (huge -> more than twice). Moreover at the "result level" it'll look as X improved pipelines and not as 2X simple pipelines.

This includes the marketing utility. If you have 16 pipelines and a 256 bit bus running at the GPU frequency you'll never be able to show that your GPU is a 16 pipelines one without using massive cheats.

And this is the exact situation that R300 is in already â€“ it canâ€™t reach its â€œbilinear fill-rateâ€ potential, as shown by 3DMarkâ€™s fill-rate tests, because there is insufficient bandwidth for it to do so. However, this scenario isnâ€™t the common case now and will become less so in the future.

Adding the extra pipelines in a fashion that R300 does also adds an equivalent shader /texture performance increase â€“ the advantage of increasing the pipelines doesnâ€™t come in showing â€œI have 8 / 16 / 24 pipelines of performanceâ€ it comes because the overall performance over a wide range of situations is increased whilst maintaining an easy level of control of how to keep those pipelines busy.

Overall in many cases the performances (clock for clock) of R300 and NV35 are similar, but they both take different approaches at doing it. Similar increases in performance levels in shader dominated situations can be obtained at by deepening the pipeline, but then that requires more control logic (software and/or hardware) in order for it to be scheduled correctly and utilized efficiently.

Tridam · Feb 8, 2004

DaveBaumann said:
Well in this specific case the rendering speed of bilinear pixels will mostly be negligible without having a huge bilinear pixels rendering speed and a huge buffer (huge -> more than twice). Moreover at the "result level" it'll look as X improved pipelines and not as 2X simple pipelines.

Click to expand...

This includes the marketing utility. If you have 16 pipelines and a 256 bit bus running at the GPU frequency you'll never be able to show that your GPU is a 16 pipelines one without using massive cheats.

Click to expand...

And this is the exact situation that R300 is in already â€“ it canâ€™t reach its â€œbilinear fill-rateâ€ potential, as shown by 3DMarkâ€™s fill-rate tests, because there is insufficient bandwidth for it to do so. However, this scenario isnâ€™t the common case now and will become less so in the future.

It can't reach its max bilinear potential but it can benefit from it. With everything the same but 16 pipes the GPU won't have been so well balanced. Instead of using 16 pipes it's more interesting to use the extra transistor of the 8 more pipes to improve the 8 existing pipes.

DaveBaumann said:
Adding the extra pipelines in a fashion that R300 does also adds an equivalent shader /texture performance increase â€“ the advantage of increasing the pipelines doesnâ€™t come in showing â€œI have 8 / 16 / 24 pipelines of performanceâ€ it comes because the overall performance over a wide range of situations is increased whilst maintaining an easy level of control of how to keep those pipelines busy.

Of course you're right and of course I think the same.

Using 8 R300 pipes instead of 4 more complex pipes was a nice decision. Because the 8 pipes can be used to do 8 simple things at the same time. For complex things ATI could have reached the same level of performance by using a few less transistors and more complex pipelines.

DaveBaumann said:
Overall in many cases the performances (clock for clock) of R300 and NV35 are similar, but they both take different approaches at doing it. Similar increases in performance levels in shader dominated situations can be obtained at by deepening the pipeline, but then that requires more control logic (software and/or hardware) in order for it to be scheduled correctly and utilized efficiently.

It requires more control logic but IMHO less than the control logic required by the extra pipelines. I think that comparing NV35 and R300 approaches could be a trap. NVIDIA's pipelines are not only more complex than ATI's ones they're also very different. I'm sure that you can have pipelines as deep as NV35 ones and as efficient as R3x0 ones by using a few less transistors than the amount required by twice R3x0 pipelines.

Dave Baumann · Feb 8, 2004

It can't reach its max bilinear potential but it can benefit from it.

But thatâ€™s one very small specific case â€“ every other rendering case for R300 there is no will require a minimum of 2 clocks per pixel. How many times will that occur in future titles? How many times will it occur with each quad operating on portions of triangles that have the same requirements?

With everything the same but 16 pipes the GPU won't have been so well balanced. Instead of using 16 pipes it's more interesting to use the extra transistor of the 8 more pipes to improve the 8 existing pipes.

In which case your instruction schedulers and shader optimizers all have to be recoded to allow for the new pipeline organization.

It requires more control logic but IMHO less than the control logic required by the extra pipelines.

Well, I think the actual control logic for what to allocate to what quad is intensely simple and itâ€™ll probably go by the tiles of the Hier-Z. There may be duplication for instruction buffers and textures caches etc.

I'm sure that you can have pipelines as deep as NV35 ones and as efficient as R3x0 ones by using a few less transistors than the amount required by twice R3x0 pipelines.

But does that fit as neatly with the rest of the of the range? Look at the differences between NV30 and 31 (or NV35 and 36) and compare that to R300 and RV350.

Tridam · Feb 8, 2004

DaveBaumann said:
It can't reach its max bilinear potential but it can benefit from it.

Click to expand...

But thatâ€™s one very small specific case â€“ every other rendering case for R300 there is no will require a minimum of 2 clocks per pixel. How many times will that occur in future titles? How many times will it occur with each quad operating on portions of triangles that have the same requirements?

I don't really understand your point here. You say that a few pixels will be rendering at 1 clock par cycle but you're against my thought that adding more x1 pipeline isn't the best idea ?

Don't forget that I wasn't talking about ATI current design specifically and that the bilinear example is just an example. Actually I was talking about the fact that having more output than the memory bus can eat isn't the best way to use transistors. You can think about more complex pixels with alpha blending instead of thinking about simple bilinear pixels.

I just agree with the fact that the bandwidth required per pixel and per cycle will go down but not with the fact that at this moment IHV can forget about simpler pixels requiring more bandwidth per cycle.

DaveBaumann said:
I'm sure that you can have pipelines as deep as NV35 ones and as efficient as R3x0 ones by using a few less transistors than the amount required by twice R3x0 pipelines.

Click to expand...

But does that fit as neatly with the rest of the of the range? Look at the differences between NV30 and 31 (or NV35 and 36) and compare that to R300 and RV350.

Something that is true between 4 and 8 pipes could be false between 8 and 16 pipes. But anyway I can't understand your point here ?

Dio · Feb 8, 2004

Xmas said:
Dio said:

Note that alpha test force Z-features off, which makes chips take a big hit too.

Click to expand...

Huh? There's no reason to turn off Z optimizations when using alpha test.

Some can still function. However, you can't do Z writes before shading, which either implies performing Z tests post-shader, or having a very long active Z lifetime, which in turn implies big caches and complex logic to handle overlapping Z writes.

nobie · Feb 8, 2004

What about a card with frame buffer compression? In that case you could output more pixels without needing a bigger bus.

Xmas · Feb 8, 2004

Dio said:
Xmas said:

Dio said:

Note that alpha test force Z-features off, which makes chips take a big hit too.

Click to expand...

Huh? There's no reason to turn off Z optimizations when using alpha test.

Click to expand...

Some can still function. However, you can't do Z writes before shading, which either implies performing Z tests post-shader, or having a very long active Z lifetime, which in turn implies big caches and complex logic to handle overlapping Z writes.

Well, yes, early per-pixel Z-tests might be difficult, but a hierarchical Z-buffer with conservative values should still work as expected.

Xmas · Feb 8, 2004

nobie said:
What about a card with frame buffer compression? In that case you could output more pixels without needing a bigger bus.

Frame buffer compression is only really feasible when you have lots of identical color values. Like with multisampling, where all samples of a non-edge pixel get the same color.

Dave Baumann · Feb 9, 2004

Tridam said:
Something that is true between 4 and 8 pipes could be false between 8 and 16 pipes. But anyway I can't understand your point here ?

There's a case for suggesting that potentially "wasting" a little in terms of output provides other benefits as well - i.e. ease of portability in lower end hardware. In the examples I cited the development must have ben quite different - RV350 just about has one quad raster pipeline lifted from the architecture intact, minimising the development, but with a deeper pipeline this may not be as easy as is the case with NV31 which needs a new pipeline organisiation in comparison to NV30.

psurge · Feb 9, 2004

Uttar... I'm not really sure how it's implemented now,
but this is what you'd have to do:

Take 3 sample vectors
s1 = (v1, x1, y1), s2 = (v2, x2, y2), s3 = (v3, x3, y3)
x = screen x-coordinate
y = screen y-coordinate
v = value of some pixel shader register

then your plane equation is

C + s.((s2-s1)^(s3-s1)) = 0

Solve for s = s1 + (dv/dx, 1, 0)
and s = s1 + (dv/dy, 0, 1) and you get

F = 1 / ((x2-x1)(y3-y1) - (y2-y1)(x3-x1))
dv/dx = F * ( (v2 - v1)(y3 - y1) - (y2 - y1)(v3 - v1) )
dv/dy = F * ( (x2 - x1)(v3 - v1) - (v2 - v1)(x3 - x1) )

let

C0 = F*(y2 - y3)
C1 = F*(y3 - y1)
C2 = F*(y1 - y2)

then dv/dx = C0*v1 + C1*v2 + C2*v3 (similarly for dv/dy)

So, besides the precomputation of 3 constant factors,
the ddx instruction for a scalar register corresponds to
a 3-component dot product. If the register is a vec4,
it's a 3x4 matrix multiplication of the 3 vector (C0,C1,C2).

Besides the fact that registers from different pixels
are involved, this is a perfect fit for the PS math units.

reever · Feb 9, 2004

So, besides the precomputation of 3 constant factors,
the ddx instruction for a scalar register corresponds to
a 3-component dot product. If the register is a vec4,
it's a 3x4 matrix multiplication of the 3 vector (C0,C1,C2).

What the heck does "besides" mean?

OpenGL guy · Feb 9, 2004

reever said:
So, besides the precomputation of 3 constant factors,
the ddx instruction for a scalar register corresponds to
a 3-component dot product. If the register is a vec4,
it's a 3x4 matrix multiplication of the 3 vector (C0,C1,C2).

Click to expand...

What the heck does "besides" mean?

"Apart from", "other than".

RussSchultz · Feb 9, 2004

In this case "in addition to"

Pete · Feb 10, 2004

Saw an ad for Samsung GDDR3 proclaiming 51GB/s bandwidth smack dab in the middle of AT's front page. Now that would be something. Pity we'll have to "settle" for 600MHz.

Geo · Feb 10, 2004

Pete said:
Saw an ad for Samsung GDDR3 proclaiming 51GB/s bandwidth smack dab in the middle of AT's front page. Now that would be something. Pity we'll have to "settle" for 600MHz.

"Itâ€™s no wonder that Samsungâ€™s GDDR3 Memory is used by leading graphics controller companies to optimize performance."

You'll notice "is" as opposed to "will be". Wazzup with that?

Geo · Feb 10, 2004

And, as far as that goes, why are the part numbers and specs for GDDR2 and GDDR3 parts the same?

WaltC · Feb 11, 2004

This is an amusing thread...

I love the title:

"NV40: 6x2/12x1/8x2/16x1? Meh. Summary of what I believe"

You've done it again, Uttar...

Among the more amusing of my observations in the thread:

*I'm not sure what the number of instructions in a shader chain has to do with the number of pixel pipes in a gpu. It seems to me that whether there's 1 instruction in the chain, or 10,000, the number of physical pixel pipes in the gpu is fixed at a static, absolute number, and describes the maximum number of pixels per clock the gpu may render to screen under any conditions. Unlike the software relative to shader instruction chains, the number of pixel pipelines in the gpu is a physical property of the gpu and is quite fixed and absolute, and quite distinct from software, it seems to me.

*Hence, how is that "6x2/12x1/8x2/16x1"...might be considered to all be descriptions of the same gpu? For instance, I cannot see how a gpu might be described as "6x2" and "12x1" at the same time, or else simultaneously be both an "8x2" or a "16x1" gpu. The term "6x2" breaks down as follows:

The first number, the "6," tells us how many pixel pipelines are in the gpu, and the second number, the "2" in this case, tells us how many texturing units are attached to each of those 6 pixel pipelines. So, "6x2" tells us the gpu has 6 pixel pipelines, and no more or less than 6, and that each of those pixel pipelines has two texturing units attached to it. This tells us in total that a 6-pixel pipeline gpu may generate, at most, 6 pixels per clock, to which either 0, 1, or 2 texels may be attached to each pixel rendered per clock.

Therefore, it is physically impossible for a "6x2" gpu to ever be "12x1", since in the first case the gpu has 6 pixel pipelines and in the second case it has 12, and a gpu cannot be both, obviously.

However, if we assume that the gpu pixel pipeline organization is actually 12x1, which is to say it physically has 12 pixel pipelines to begin with, to each of which 1 texturing unit is attached, then such a 12 pixel pipeline gpu could function thusly per clock: 12x1/12x0/6x2 (in the last case, 6 of the 12 pixel pipelines are used to render 1 pixel to which 1 texel is attached, and the other 6 pixel pipelines are used to only render a texel by way of their attached texturing units, so that 6 pixels per clock, to which 2 texels are attached to each, is rendered per clock, for 6x2.)

The relationship between 8x2 and 16x1 is exactly the same as the one described above for 6x2 and 12x1. So...what's actually being said here is that nV40 is either 12x1 or 16x1, but obviously it cannot be both. And, depending on whether it is 12x1 or 16x1, that will determine whether it is cable of either 6x2 or 8x2 when multitexturing, and again, obviously, both are not possible in the same gpu. So, the actual statement I am assuming Uttar meant to make is that he isn't sure whether the nV40 is a 12x1 or a 16x1 pixel pipeline organization, but it necessarily has to be one or the other (I lean to thinking nv40 8x1 or 8x2, but that's neither here nor there at the moment...

).

In other words, it just isn't possible to state that "it doesn't matter" what the pixel pipeline organization for nV40 is, since without knowing that number, which correlates to the physical architecture of the gpu, I don't think it would be possible to rationally discuss any of the performance characteristics of the gpu. Basically, once you have determined what the physical pixel pipeline organization of a gpu is, you can work backwards from there to factor in conditionals, such as shader instructions, trilinear filtering, texels attached per pixel, and so on, to understand what the likely impact of those conditions on performance may be. If you don't know what the fixed pixel pipeline organization is, then it seems to me you cannot figure anything else relative to performance, either...

(As an aside, this exactly corresponds to the initial frustration I felt in trying to decipher nV30's performance. Once it became clear that the organization was 4x2, instead of 8x1, the picture at last began to make sense.)

*The term "double pumped" escapes me as to how it applies to pixel pipeline organization in a gpu. In DDR ram and cpu fsb's, "double pumped" refers getting data on the rising and falling edges of the clock, instead of on a single edge, with the effect of getting 2x as much data per clock as is possible with a SDR. How is this related to pixel pipelines in a gpu?

I mean, it isn't possible to "double pump" pixels per clock and to get two pixels per clock out of a single pixel pipeline per clock, is it? So, I've no idea of what's talked about with the term "double pumped" used to describe pixel pipeline organization in a gpu. As I understand it, you can get an absolute maximum of 1 pixel per clock from each of a gpu's pixel pipelines. Hence, a gpu with 6 pixel pipes could generate a maximum of 6 pixels per clock, but never twelve, since I don't see how that would be physically possible. Eh?

This is an amusing, if confusing, thread...

digitalwanderer · Feb 11, 2004

WaltC said:
This is an amusing, if confusing, thread...

I agree with half of that statement....this is the most confusing thread I've tried to make sense of since joining this place!

NV40: 6x2/12x1/8x2/16x1? Meh. Summary of what I believe

Xmas

Porous

Tridam

Dave Baumann

Gamerscore Wh...

Tridam

Dave Baumann

Gamerscore Wh...

Tridam

Dio

nobie

Xmas

Porous

Xmas

Porous

Dave Baumann

Gamerscore Wh...

psurge

reever

OpenGL guy

RussSchultz

Professional Malcontent

Pete

Moderate Nuisance

Geo

Mostly Harmless

Geo

Mostly Harmless

WaltC

digitalwanderer

wandering

Similar threads