I just wrote a pipeline article

Arun

Unknown.
Moderator
Legend
Hey everyone,

I just finished writing a small pipeline-related article because I was quite annoyed by all the misconceptions I see ( not so much around here, but a lot more in less techy forums )

As I say in the article, while I believe there should be a minimal amount of mistakes/error ( beside the NV30 speculation part, where I believe there likely is a ton of them ) , I'd really appreciate it if people would show me my errors/mistakes so I could correct them and hopefully improve the article.

Anyway, here's the link: http://www.notforidiots.com/pipelines.php

The current version is quite hard to understand for "non-adepts", so I'll try to write a "newbie" version in the near future.


Uttar
 
A pipeline is a group of units, used in a specific order, and their associated cache, all of which works on one, and only one, pixel.

If this was the case then it would mean A pixel to be fully rendered would have traverse the entire pipeline, which would take x number of clocks. Unless the increase in clock speed due to pipelining was greater than a linear increase things would end up running slower.

I think you need to revise it, such that many pixels can be rendered at once, but they're all in various stages of the pipeline. So defining a stage would be useful.
 
Perhaps you could say a pipeline outputs one and only one pixel, not works on one and only one pixel.
 
Hmm, yes, I think that's not very precise now that you mention it...
Although I'm really just talking about *pixel* pipelines. Maybe I should have made that clearer.

And with loopback, does a pixel pipeline really work on multiple pixels at once? I kinda fail to see how...
 
No, it works on the same pixel on different layers of that pixel (i.e. the texture addressing stage on an R300 could be generating the address for the second texture layer while the texture sampling stage is working on the sample for the first layer, and so on).
 
Ah, I see, interesting. Hmm, but then, that just means my definition is not sufficently precise, not incorrect. Don't know exactly how to correct that, though...
Going to bed now anyway, so I'll correct it sometimes tommorow.


Uttar
 
Hey Uttar, that was Uttarly fabulous (I couldn't resist).

A suggestion for the newbie article if you ever get time... diagrams preferably pastel coloured. ;)
 
Hmm....I still like my proxels (goshdurnit! ;) ), as it is a compromise between complete information (did you ever finish reading it, Uttar? You state your dissatisfaction was based on not reading the links all the way through) and ease of understanding. Maximum, Standardized, and Minimum "proxel fillrates" do greatly facilitate expressing most of the concepts you cover in a useful and accessible way (a part of my discussion on proxels you may not have read)...what is missing is precision, which can easily be discussed within that context.
Tex ops are missing too, but how do you express the full impact of that (without a long explanation) beyond what you can already do as facilited by the Min, Max, and Std contexts? It provides 3 opportunities to illustrate issues that limit applicability, and the relevance of texture ops can be expressed more understandably within that (IMO).

I think I've already offered most of my take on the issues you are addressing (links available in the thread of yours you mention in your article, which you might want to "linkify" at some point), if that's what you're looking for.
 
links available in the thread of yours you mention in your article, which you might want to "linkify" at some point

You mean hyperlink... hang on isn't that someones supposed IP. :p
 
I'm not sure it's a good idea to redefine the word 'pipeline' - which is what's been done here... this makes the statement 'the nv3x may not have pipelines' a bit strange: since what is talked about is some abstract concept of a pipeline anyway, the nv3x still has these 'abstract pipelines' - it may not have pipelines in the conventional sense.

Of course, the nv3x is pipelined. Just because in some stages of the pipe you may be travelling through the same hardware unit you travelled through before, or that a particular pipe stage outputs to some buffer from which one of a pool of units picks up the result doesn't make it an architecture that isn't pipelined. For example, consider the Pentium 2 (plenty of architectural details on that) - uops go into the reorder buffer to be picked out by the scheduler and passed to any free execution unit. This might be similar to the nv3x (or it might not - I have no idea at all) but it's still just a pipeline.

In pretty much every case for 3D graphics you'll probably see the same kind of effect: the performance of a pipeline is limited by the bottleneck of its slowest part, but some bottlenecks can be alleviated by the introduction of load balancing or latency compensation FIFO's. So, what you need to consider for looking at an architecture's performance is 'what are the bottlenecks'?
 
Eh, I never expected people not to immediatly realize I'm limiting myself to pixel pipelines...
I'll really, really have to make this clearer once I'll get back from school ( in about 10 hours )

Also, Dio, I don't think this is similar to the Pentium 2 at all. I think the NV3x is pipelined on many levels, such as:
- Global: VS -> Triangles -> PS -> Output
- Micro: VS & PS units likely got mini-stages in order to be faster & more efficient

However, I'm wondering right now if the NV3x really got another form of pipelining, that is task-specific pipelining. I actually don't think it got:
FP->FX->FX - because I'm pretty sure it could easily do FP->FX->FX in one clock cycle too. Actually, IIRC, thecpkrl ( sp? ) said he thought the compiler smartly reallocated instructions. But someone gotta have to explain me how to do that for dependant instructions! :D

But yes, the NV3x is still very much pipelined in several senses of the term for sure ( and maybe I'm confused and it really is in all senses of the term ) - going to have to fix that.

Uttar
 
OpenGL guy said:
Dio said:
So, what you need to consider for looking at an architecture's performance is 'what are the bottlenecks'?
Are we talking ale or lager bottles? ;)
Have I ever been picky about which it is? It's what it tastes like that counts!
 
Uttar said:
But yes, the NV3x is still very much pipelined in several senses of the term for sure ( and maybe I'm confused and it really is in all senses of the term ) - going to have to fix that.
I found this, might help a bit. It's a definition of a pipelined CPU architecture.

http://216.239.37.100/search?q=cach...finition+pipeline+architecture&hl=en&ie=UTF-8

I would say that applies pretty much to 3D as well - we use the terms pipe stage, process cycle, throughput, etc.

I think the problem is you're trying to reconcile the terms that have been latched onto by the mainstream with what's really going on inside the chip. As such, I'd stick with the real meaning of 'pipeline' and phrase it like 'This is what would traditionally have been defined as an 8-pipe architecture, in the days where one pipeline processed one pixel'. Of course, the more-modern definition of that has also changed slightly: now it is probably 'one pixel per clock' not 'one pixel'.

Ain't semantics great :)
 
Okay, so I did a few corrections.
First, I changed the definition of a pipeline, partly based on the paper Dio linked ( thanks Dio! )
Also, I noted that the NV30 still is pipelined, and I modified the idea of it not having pipelines to "having one uber pixel pipeline".

Is this more accurate? Any other required corrections?


Uttar
 
I think its still pretty clear that NV30 has four pipelines - just from the very fact that some of the operations take place (dx, dy for instance). There may be cases where those four pipelines combine operations to only produce one pixel (as GF3/4 did with the register combiners) but its still basically 4 pipelines.
 
Dave: Well, the NV30 *obviously* got the equivalent of four "traditional pipelines".
What I'm trying to say here is that the NV30 is really one pipeline, with many functional calculation units and color/Z output units.
As David Kirk said, it's a processor - not a blender. And as such, I believe that the NV30 actually *one* pixel pipeline, with one pool of registers, and that it can actually work on multiple pixels at once in *one* pipeline.

Now, obviously, the question that arises is... why?
Well, we obviously don't know for sure what the original design goals were. We know that features were cutted due to lack of resources and time. Maybe the original goal was to have branching in the PS too - who knows. I'm sure there are other advantages to such technique, such as a potentially more efficient use of calculation units ( you can reorder whichever way you see fit )

I believe the reason it operates so much like a four pipeline architecture is that this is actually the most efficient way to use the output units.
You see, that means the NV30 could actually operate on 16 pixels, or maybe even more ( David Kirk mentionned 32 I think ) at once - but then register usage would be a truly dramatic problem and the color output units would first be unused, and then suddently completely satured.

Remember those "The NV30 can actually operate like a 4 or 8 pipelines architecture" rumors which actually originally arrived even before he NV30's launch IIRC? I actually believe they are justified, in that it would emulate that number of pipelines, but it just makes no sense for nVidia to use a 8 pipelines design when it can't output 8 pixels - heck, they don't have 8 color output units! ( and that's because with the original NV30 design, you only had a 128-bit bus, which wouldn't be capable of outputting 8 32-bit values/clock anyway - I guess they were too lazy to change it in the NV35 )

Also, that means maybe the register usage problem is more important on the NV30 when running Z/Stencil-only passes, since it then works on 8 pixels. Would be a very interesting test IMO.


Uttar
 
The article definitely feels better. I'd maybe say the definition of pipelining is 'where concurrent operations or parts of operations are overlapped in time during execution' but I'm just nitpicking now :)

It's an interesting point you raise as to whether hardware with one 'physical pipeline' but processing four pixels simultaneously down that pipeline has 'four pipelines'. I don't know the answer to that one!

However, reading what you just posted I'm very confused :)

I still think a bottleneck-based-analysis might prove more fruitful and enlightening - it would clarify the 'sometimes it has 8 pipelines and sometimes it has 4' stuff, for example.
 
I think you are distorting the meaning of pipeline again. A pipeline seems to be the coherency from input to output within a fixed limitation of execution.

For color output (i.e., "pixels"), that is limited by color output capability, or else you could multiply by stages for other pipelined architectures as well.

For other "elements", as nVidia has made that separate in this architecture, it is limited by the output capabilities for those elements.

Your count of 16 pixels it "could" work on could easily be true for, for example, the R300 as well. As long as you're actually talking about a pipeline, though, it cannot.

Not that we know how NV35 stacks up in this regard, but not changing it in the NV35 wouldn't have to be "laziness", it could just be a limitation of the design. It doesn't work on 8 pixels, it works on discrete coherency concepts that it can output 8 at a time when they are just stencil or Z (or maybe color, too, and therefore pixels, with the NV35...we still have to test). So could the R300 (work on 16 "concepts"), but there is no point calling it 16 when it can only output 8.
 
Back
Top