NV40: 6x2/12x1/8x2/16x1? Meh. Summary of what I believe

Dave, OT: Are you sure the 5700U is capable of 7600M AA Samples/sec? I believe it has only 8 ROPs, considering it's 2x2/4x1, and the 4 ROPs/output number is for the standard output; not the double-pumped one.


Uttar
 
Pete said:
I thought it had the same vertex power, as both are clocked the same and use the same three vertex shaders...?

Yeah your right :oops: I had a bit of a brain fart and thought the fx 5700 ultra was clocked to 500mhz
 
just some thoughts...

Up till now we have been discussing working on pixel quads due to (supposedly easy)support of ddx/ddy instructions. Note however that to actually compute dv/dx and dv/dy for a function v(x,y) using a linear approximation, you only need v(x,y) for 3 samples (gives you a plane equation involving v,x,y).

Why would you do this instead of using a quad, where you could theoretically just use subtractions from neighbouring values?

1. Supersampling on a sparse grid - adjacent samples no longer have identical screen x or y, so simple subtracts are out the window.

2. Multi-sampling : I'm not sure but I was under the impression that when centroid sampling is enabled, the pixel shader is not necessarily running for a fixed (x - xpixel, y - ypixel) (<-- location relative to enclosing pixel).

So using pixel triangles would allow for better AA implementation, and would have the added benefit of resulting in one-less pipeline being wasted when running on small triangles. Also, for one, two, and three pixel triangles you can skip triangle setup completely and just pass the output of the VS for triangle 0,1,2 directly to PS input for sample 0,1,2.

Serge
 
Lost : the register problem is mostly due to the length of the pipeline which need a lot of in flight pixels. Pixel pipelines are much deeper than vertex pipelines. It's why the register issue doesn't affect (or not that much) VS performance.


Some personal quick thoughts regarding the pipeline architecture of current and future products:

Well first I think that we need to talk about it in two ways : how it really works at the hardware level and how it looks to work at the "result" level. IMHO, the latter is the most important one wile the former is only needed to explain very specific details. I also think that we (press members) have to use the former very carefully because it's so easy to confuse people with marketing BS (IMHO saying "NV35 is 4/8 pixel pipelines" is BS) instead of educating them with technical clarifications.


It seems like everyone is forgetting an important detail when talking about number of pipelines or "double pumped" mode. The more pipelines you have, the more outputs you have -> the wider the memory bus has to be. What I mean is that 8 pixels output mode (or double pumped mode) on NV30 would have been useless. You can't compress (enough) normal pixels for that. However it should have been useful on NV35 but too difficult without a major core revision.

Regarding double pumped mode. Actually I don't like to talk about a "double pumped mode". I prefer talking about a 4 pipes design with limitations than about a 2 pipes with double pumping possibility. The way I see it, NV36 is a 4 basic pipes GPU with a cheap integration of complex things (basically: loopack). I don't see it as a 2 pipes GPU with a double pumped mode. However I can't see NV30/35 as a 8 pipes GPU. IMO NV30/35 are 4 pipes GPU with full integration of complex things and with the possibility to work on 8 Z data instead of 4 color data + 4 Z data. We don't need to try to find some sort of justification to the 8 pipes marketing (error/BS) that came with NV30.


16x1 mode on next gen could be possible but not usable without memory running at twice the core frequency. It could also be useless. 8 pipes with 500 MHz give enough max fillrate. 4 pipes don't. 8 pipes architecture was needed. 16 pipes are not needed at the moment. We now need more power when doing complex things. Doing this by using 12 or 16 pipes is a possibility but not a necessity. The fact that nvidia's next gen will use memory with a higher frequency than the core is not a secret. This leaves a possibility to use more than 8 normal pixels outputs but this doesn't mean that it would be best way to use transistors.


Working on triangles instead of quads is seducing in some ways. But it isn't a win-win idea and AFAIK it won't be the case. IMHO working on twice the quad numbers but in 2 cycles could be an interesting move.
 
The more pipelines you have, the more outputs you have -> the wider the memory bus has to be.

I wouldn't say thats strictly speaking the case. With shader lengths increasing then the average pixels per clock will go down, hence the bandwidth required per pipe won't be as important a metric. You'll probably want to adequately increase the buffer on the output such that if you do suddenly get large batches of single textures, bilinear pixels you won't stall the rest pipeline too much, but I should imagine the number of clocks per pixel is already quite high and that will probably increase.
 
DaveBaumann said:
The more pipelines you have, the more outputs you have -> the wider the memory bus has to be.

I wouldn't say thats strictly speaking the case. With shader lengths increasing then the average pixels per clock will go down, hence the bandwidth required per pipe won't be as important a metric. You'll probably want to adequately increase the buffer on the output such that if you do suddenly get large batches of single textures, bilinear pixels you won't stall the rest pipeline too much, but I should imagine the number of clocks per pixel is already quite high and that will probably increase.

IMHO, you're right and wrong at the same time. Right for everything you've just said and wrong... for the same things. If the pipelines are meant to do complex things you don't need more pipelines but improved pipelines so the average bandwidth per pipeline doesn't change that much.

Anyway what I wanted to say is that outputting a pixel per pipeline per cycle needs a well adjusted memory bus and that the NV30 memory bus wasn't wide enough for a "double pumped mode". I can't think about a serious (*cough* xgi *cough*) IHV using more pipelines than the number of pixels able to go through the memory bus.
 
Even before we get to any shader lengths and just sticking with bog standard texturing, R3x0 isn't going to be outputting 8 pixels per clock because 99% of people are going to be running in Trilinear, requiring 2 clocks per pixel with only one texture sampler. I think increasing the number of outputs inline with the number quads (i.e. 4 outputs per quad) just makes for a flexible design but not overcomplicating it. Theoretically, clock for clock, NV35 and R3x0 are outputting similar trilinear filter pixels or Z/Stencil rates, but they both have very different methods of going about it.
 
DaveBaumann said:
Even before we get to any shader lengths and just sticking with bog standard texturing, R3x0 isn't going to be outputting 8 pixels per clock because 99% of people are going to be running in Trilinear, requiring 2 clocks per pixel with only one texture sampler. I think increasing the number of outputs inline with the number quads (i.e. 4 outputs per quad) just makes for a flexible design but not overcomplicating it. Theoretically, clock for clock, NV35 and R3x0 are outputting similar trilinear filter pixels or Z/Stencil rates, but they both have very different methods of going about it.

Even even before that, R3x0 can't output 8 pixels/clock in real life because it hasn't enough memory bandwidth.

I can take a specific case. A game with many sprite trees like Colin Macrae. The sprites are always bilinearly filtered. In this case, R3x0 are a lot faster than NV35 and NV35 isn't really faster than NV36. A NV30 in 8x1 mode won't have helped here but a NV35 in 8x1 mode would have helped.
 
Tridam said:
I can take a specific case. A game with many sprite trees like Colin Macrae. The sprites are always bilinearly filtered. In this case, R3x0 are a lot faster than NV35 and NV35 isn't really faster than NV36. A NV30 in 8x1 mode won't have helped here but a NV35 in 8x1 mode would have helped.

Somehow I suspect that these cases are not really influencing future hardware design particularily greatly! ;)
 
DaveBaumann said:
Tridam said:
I can take a specific case. A game with many sprite trees like Colin Macrae. The sprites are always bilinearly filtered. In this case, R3x0 are a lot faster than NV35 and NV35 isn't really faster than NV36. A NV30 in 8x1 mode won't have helped here but a NV35 in 8x1 mode would have helped.

Somehow I suspect that these cases are not really influencing future hardware design particularily greatly! ;)

Sure but that wasn't the point ;)


My point was that increasing the number of basic pipelines over the number of pixels able to go through the memory bus wasn't useful -> 8x1 mode on NV30 or Xx1 mode on next gen.

Regarding your trilinear exemple and the flexibility of an increased number of pipelines. Using twice the number of pipelines because of the trilinear taking 2 cycles is a bad idea. You'll be able to do the same at a reduced cost by using 2 texturing units per pipeline. What about the flexiblity ? Well the flexibility of outputting twice the number of pixels when doing bilinear filtering won't exist if the memory bus isn't wide enough.
 
Tridam said:
Regarding your trilinear exemple and the flexibility of an increased number of pipelines. Using twice the number of pipelines because of the trilinear taking 2 cycles is a bad idea. You'll be able to do the same at a reduced cost by using 2 texturing units per pipeline.

And this, more or less, is the situation that R3x0 is in already.

What about the flexiblity ? Well the flexibility of outputting twice the number of pixels when doing bilinear filtering won't exist if the memory bus isn't wide enough.

Yes, but what if the next triangle to be operated on had 100+ long shader instruction program? If you have a sufficiently large buffer you can still output the bilinear filtered pixels from the buffer with no overall loss of performance because you are backed up by these pixels.

The question is to assess where the largest bottlenecks are over a wider range of current and future titles.

[edit] Also, IIRC, the way R300 allocates quads means that it can frequently be operating on different triangles for each quad pipeline, and these may well have different requirements from each other.
 
Consider yourself plugged!

Heh, you rock, Dave. I was pretty sure that the two were the same (no doubt I knew so from my reading here), but I didn't want to take the time to confirm my 3D suspicions. I didn't realize the chip chart had a board brother. Very nice. I'll be pointing more non-technical people to it (the chip chart would have been over the heads of most, thus I usually point to TR's big blue table).

BTW, did you see Uttar's post about your 5700U page's incorrect AA Sample Fillrate? The RV360 seems to be getting way too much credit for its geometry rate, too.
 
Yeah, I'm testing 5700 as I'm not entirely sure what it is at the moment. I've got a standard 5700 (non-Ultra) in the rig for a review at the moment but the results from that are inconclusive due to lack of memory bandwidth - I'll look closer with the 5700U in. RV360 geometry rate was a typo - wrong geometry multiplier in the table - fixed now, thanks.
 
I think that they chose 8x1 for the R300 instead of 4x2 it that it's a simpler design!
It's two completely separated 4x1 unit which I'm sure is simpler (for example less read/write ports).

It might even be true that this choice didn't increase the transistor count.
 
psurge: Interesting :) But then there's one thing I wonder - how do the NV31/NV34/NV36 do to calculate DDX/DDY when in 2x2 mode? (actually, I believe the NV31 was originally wanted to be 3x2, but I'm not sure of this, and it's another story completely; the NV34 always was supposed to be a 2x2).

Could it be possible they're looking at the last 2 pixels, and then operating as if it was a quad?

I'm asking this due to the 2x2 layout of the VS units when in texturing/pixel mode. There are two ways for NVIDIA to implement this IMO:
1) The simple way: 3x2+3x2+2x2
2) The harder, but possibly more elegant way: (3+1)x2+(3+1)x2 = 4x2+4x2

The first way seems to be the better choice, as it would most likely be easier to implement; just sending pixels in the vertex pipeline and stuff. But in case 4x2 is easier for certain things such as DDX/DDY, then, perhaps the second would be a smarter choice.

Tridam: When it comes to memory bandwidth issues with pipelines, IMO that's true in most cases, but certain people (Carmack, for example) won't have those problems in their engines.
The first pass only writes/reads to the Z buffer. The later only read to the Z buffer and write colors IIRC.
In either case, you've got (approximatively) half of the normal output bandwidth required, or at least no where the maximum.

Instead, I'd talk of the number of ROPs as the limitation; the NV30 could have had a 16x0 mode, but it would have been for 0xAA only. Not very useful, really.

Still, why would NVIDIA want 16x1 then, and not 16x0? Marketing.
Being able to legitimately say you've got a "16 pipelines" design is such an advantage for Joe Consumer it's not even funny, I fear... Although maybe I'm wrong and it's just 16x0. What I'm nearly 100% sure of, though, is that there is a 16 pixels/zixels/whatever output/clock possible in the architecture.


Uttar
 
Being able to legitimately say you've got a "16 pipelines" design is such an advantage for Joe Consumer it's not even funny, I fear... Although maybe I'm wrong and it's just 16x0. What I'm nearly 100% sure of, though, is that there is a 16 pixels/zixels/whatever output/clock possible in the architecture.

You said marketing didn't you?

Watch this one instead:

PowerVR's Z32[TM] technology enables KYRO to process the invisible polygons that affect the stencil buffer exceptionally fast, giving KYRO a significant performance advantage in situations where multiple stencil buffer accesses are required.

"Z32" and to that "trademarked" on an outdated product? ROFL :D

***edit: the additional marketing terminology must have been added only recently to their white papers. Oh and I forgot the source link sorry:

http://www.pvrdev.com/pub/PC/doc/f/Shadows.htm
 
Uttar said:
psurge: Interesting :) But then there's one thing I wonder - how do the NV31/NV34/NV36 do to calculate DDX/DDY when in 2x2 mode? (actually, I believe the NV31 was originally wanted to be 3x2, but I'm not sure of this, and it's another story completely; the NV34 always was supposed to be a 2x2).

Could it be possible they're looking at the last 2 pixels, and then operating as if it was a quad?

Even with a 2x2 Pixelprocessor you can work with quads. They only difference is that you need two clocks per quad. In this case you will have on two stages of the pipeline the same quad. You can do they same thing with a single pixel pixelprocessor. In this case you need four clocks per quad.
 
Dave,
According to the few FX5700U I've seen so far, they all seem to have a memclk of only 450MHz, not 475, as in your comparison. edit: The RAMDACs on both parts should be 400MHz..
Apart from that, /me thinks he sees all double 'cept for geometry ;)
And that's only logical: nV wants to sell their chips unaltered as Quadros, so they need strong geometry units (or quite a few not so strong ones) - produced at a very competitive price tag in form of nV36.
Meet the next generation of CAD/CAM-Cards.

For complex rendering as in movie-effects, they still have the more complex nV35/8 with more FPU-Power.

Uttar, AiL:
Not regarding what i have been told so far, here's another speculation on my side (already postet in the 3DCenters' nV40-Thread):
AFAIK for VS3.0-compliancy the vertex units have to be able to do texture accesses. Again, taken for granted, that nV wants to push 3.0-Capability fot it's next generation (for whatever reason, may it be that they think, ATi's not quite ready for it...), they might want to be performance just slighty ;) better than in their nV3x-line.
Would it make sense, that the VS become a separate TMU for each of it's virtual pipelines?
And wouldn't it suffice for that TMU to be bilinear only?

If so, then my best guess is, that they tried and made their design capable of using that TMU for casual texture-lookups when not using 3.0-Shaders so that in the end they have a chip where each pipeline can, under ideal circumstances access three bilinear TMUs at once?

That would yield them trilinear filtering almost for free, they'd still be able to get a decent boost off their reduced trinilinear driver-mockery of late or they could get simply an enormous boost to bilinear texfill (given that many texturelayers even in recent games use only bilinear filtering or, as in "Aquanox2 Revelation" when using in-game AF, detail textures for example are only bilinear).

I am not a GPU-Designer nor a programmer, so i could be completely wrong on all accounts, but it sounds quite logical to me.
 
Tridam said:
Even even before that, R3x0 can't output 8 pixels/clock in real life because it hasn't enough memory bandwidth.

I can take a specific case. A game with many sprite trees like Colin Macrae.
Note that alpha test force Z-features off, which makes chips take a big hit too.
 
Back
Top