NV4x/G7x shader pipelines architecture

nAo

Nutella Nutellae
Veteran
Increased scalability in the fragment shading pipeline

A fragment processor includes a fragment shader distributor, a fragment shader collector, and a plurality of fragment shader pipelines. Each fragment shader pipeline executes a fragment shader program on a segment of fragments. The plurality of fragment shader pipelines operate in parallel, executing the same or different fragment shader programs. The fragment shader distributor receives a stream of fragments from a rasterization unit and dispatches a portion of the stream of fragments to a selected fragment shader pipeline until the capacity of the selected fragment shader pipeline is reached. The fragment shader distributor then selects another fragment shader pipeline. The capacity of each of the fragment shader pipelines is limited by several different resources. As the fragment shader distributor dispatches fragments, it tracks the remaining available resources of the selected fragment shader pipeline. A fragment shader collector retrieves processed fragments from the plurality of fragment shader pipelines.
 
So, that looks like G80's robust scheduler re ps anyway. . . It seems to me we usually see these patents a few months before the actual hardware embodying them (at least in those cases where we've been able to identify that they've been productized at all).

The meat:

[0008] In an embodiment of the invention, a fragment processing unit includes a fragment shader distributor, a fragment shader collector, and a plurality of fragment shader pipelines. Each fragment shader pipeline is adapted to execute a fragment shader program on a segment of fragments. The plurality of fragment shader pipelines operate in parallel, executing the same or different fragment shader programs. The fragment shader distributor receives a stream of fragments from a rasterization unit. The fragment shader distributor dispatches a portion of the stream of fragments to a selected fragment shader pipeline until the capacity of the selected fragment shader pipeline is reached or until no more fragments arrive within a preset duration. The fragment shader distributor then selects another fragment shader pipeline. The portion of the stream of fragments that is sent to the selected fragment shader pipeline is called a fragment stream segment. The capacity of each of the fragment shader pipelines is limited by several different resources. As the fragment shader distributor dispatches fragments, it tracks the remaining available resources of the selected fragment shader pipeline. A fragment shader collector retrieves processed fragments from the plurality of fragment shader pipelines. The fragment shader collector follows the same selection order as the fragment shader distributor to maintain the order of the stream of fragments.

[0009] In an embodiment, a graphics processing subsystem including a fragment processor adapted to determine at least one value for each fragment of a stream of fragments. The fragment processor comprises a first fragment shader pipeline adapted to execute at least a portion of a fragment shader program on a segment of fragments. The fragment processor also includes a fragment shader distributor. The fragment shader distributor is adapted to receive a stream of fragments, to select the first fragment shader pipeline to execute a first portion of the stream of fragments, and, for each received fragment of the stream, to determine if the received fragment fits within the segment of fragments of the selected fragment shader pipeline, and to dispatch the received fragment to the selected fragment shader pipeline in response to a determination that the received fragment does fit within the segment of fragments of the selected fragment shader pipeline. A fragment shader collector is adapted to select the first fragment shader pipeline and to retrieve each fragment in the segment of fragments from the fragment shader pipeline selected by the fragment shader collector in response to a signal indicating that the fragment shader pipeline selected by the fragment shader collector has completed the execution of the fragment shader program on the segment of fragments.

[0010] In a further embodiment, in being adapted to determine if the received fragment fits within the segment of fragments of the first fragment shader pipeline, the fragment shader distributor is adapted to determine some resource requirements of the received fragment, to determine a measurement of available resources of the selected fragment shader pipeline, and to generate a signal indicating that the received fragment fits within the segment of fragments of the selected fragment shader pipeline in response to a determination that the resource requirements of the received fragment do not exceed the measurement of the available resources of the selected fragment shader pipeline.

[0011] In another embodiment, the graphics processing subsystem includes a second fragment shader pipeline adapted to execute at least a portion of a fragment shader program on a segment of fragments. In response to a determination that the received fragment does fit within the segment of fragments of the selected fragment shader pipeline, the fragment shader distributor is adapted to select the second fragment shader pipeline and to dispatch the received fragment to the selected fragment shader pipeline. In an additional embodiment, the fragment shader collector is adapted to receive a signal indicating the selection of the second fragment shader pipeline. In response to the signal, the fragment shader collector is adapted to select the second fragment shader pipeline. The fragment shader collector is further adapted to retrieve each fragment in the segment of fragments from the fragment shader pipeline selected by the fragment shader collector in response to a signal indicating that the fragment shader pipeline selected by the fragment shader collector has completed the execution of the fragment shader program on the segment of fragments.
 
nAo said:
NV4x/G7x shader pipelines architecture
geo said:
So, that looks like G80's robust scheduler

:?:

Only one of these two personsis right!

Who would that be? For the answer, stay tuned until the next episode!
 
Vysez said:
:?:
Only one of these two personsis right!

Who would that be? For the answer, stay tuned until the next episode!
I'm right :p
But I could wrong..;)
That patent descrive a part that imho completely matches the informations we have about G7x (in fact I believe NV40 was more primitve than that..)
 
Vysez said:
:?:

Only one of these two personsis right!

Who would that be? For the answer, stay tuned until the next episode!

Well, who are you going to believe, Vy? nAo who is a dev, has access to NDA info, and an extensive network of contacts in the industry? Or me, who has the Inq?

I mean, c'mon.

<well, okay, I do have B3D forums too, but then so does he>
 
so is nvidia abandoning the long pipeline approach with the G80?
and doing somthing similar to the R5XX 'ultra-threading' by using arrays of proccessing units, which work on very small threads, instead of pusing a huge amount of pixels into a very long pipeline? or is their 'schedualer' works with in the fragment pipe and just helps to proccess the thread more efficiently?
 
Last edited by a moderator:
DOGMA1138 said:
so is nvidia abandoning the long pipeline approach with the G80?
I dont think you can't deduce this from that patent, imho if you read it it's pointing exactly toward the opposite direction, that's why I believe it's related to NV4x/G7x and not to G80
 
I've been wanting to ask this question for a while. If you want to turn the latency problem into a bandwidth problem, fragment state is going to be sitting around for tens to hundreds of cycles while other things happen, whether its a g7x/g8x/r5xx/r6xx and whatever the size of a "batch" is. Smaller batches allow more freedom in deciding what to do next (better overlap of IO and computation), but a cache miss is still going to cost you many cycles (of latency).

So... what exactly are people talking about when they say a "long" or "short" GPU pipeline? (I cannot imagine an implementation that has hundreds of HW stages separated by latches, at least when considering the pixel shader subsytem by itself).

Edit: to clarify what I mean. There are quite detailed pipeline diagrams available for CPUs, and they usually have a couple stages for instruction decode, a couple for rename/schedule, a couple for execute (and so on). So when people say GPUs have a "long" pipeline,
does that mean that there are many more such stages (e.g. in a GPUs pixel shader unit), or simply that a pixel is "in flight" many cycles
longer than for example a CPU "instruction"?
 
Last edited by a moderator:
nAo said:
I dont think you can't deduce this from that patent, imho if you read it it's pointing exactly toward the opposite direction, that's why I believe it's related to NV4x/G7x and not to G80

Or even the entire DX9.0 family of GPUs.
 
psurge said:
I've been wanting to ask this question for a while. If you want to turn the latency problem into a bandwidth problem, fragment state is going to be sitting around for tens to hundreds of cycles while other things happen, whether its a g7x/g8x/r5xx/r6xx and whatever the size of a "batch" is. Smaller batches allow more freedom in deciding what to do next (better overlap of IO and computation), but a cache miss is still going to cost you many cycles (of latency).

So... what exactly are people talking about when they say a "long" or "short" GPU pipeline? (I cannot imagine an implementation that has hundreds of HW stages separated by latches, at least when considering the pixel shader subsytem by itself).

Edit: to clarify what I mean. There are quite detailed pipeline diagrams available for CPUs, and they usually have a couple stages for instruction decode, a couple for rename/schedule, a couple for execute (and so on). So when people say GPUs have a "long" pipeline,
does that mean that there are many more such stages (e.g. in a GPUs pixel shader unit), or simply that a pixel is "in flight" many cycles
longer than for example a CPU "instruction"?

Classical GPU designs use the pipeline itself as storage and latency compensation. Additional the have the memory controller for the textures as part of the pipeline. As you can see on this old Rampage design the whole pipeline have many stages and most of the in the memory part. Modern GPUs with real floating point calculation and higher clocks have even more Stages.

 
DOGMA1138 said:
so is nvidia abandoning the long pipeline approach with the G80?
and doing somthing similar to the R5XX 'ultra-threading' by using arrays of proccessing units, which work on very small threads, instead of pusing a huge amount of pixels into a very long pipeline? or is their 'schedualer' works with in the fragment pipe and just helps to proccess the thread more efficiently?

Does it say anything about the thread size in there? (yes, I'm a lazy bastard :))
 
DOGMA1138 said:
so is nvidia abandoning the long pipeline approach with the G80?
and doing somthing similar to the R5XX 'ultra-threading' by using arrays of proccessing units, which work on very small threads, instead of pusing a huge amount of pixels into a very long pipeline? or is their 'schedualer' works with in the fragment pipe and just helps to proccess the thread more efficiently?

What leads you to believe that? I don't see anything in the patent that implies that. I actually don't see anything that disqualifies it from being a NV4x thing.
 
Back
Top