Unified Shaders: With traditional pipes?

I agree with Rys..nvidia will switch to unified shading at some point in the future.
It's not like they're not able to design a unified shading GPU, it's like they're going to do the switch when they'll think the time has come.
I mean, they're talking about unified shading in a 2 years old patent ;)

ACROSS-THREAD OUT OF ORDER INSTRUCTION DISPATCH IN A MULTITHREADED MICROPROCESSOR
the plurality of threads includes a first group of threads having a first thread type and a second group of threads having a second thread type, and wherein the selection logic circuit is further configured to select one of the plurality of threads based at least in part on respective thread types of each of the plurality of threads.
wherein the selection logic circuit is further configured to select a first candidate thread having the first thread type and a second candidate thread having the second thread type, and to select between the first candidate thread and the second candidate thread based on the respective thread types.

For example, if the thread types correspond to pixel threads and vertex threads, it may be desirable to give priority to vertex threads (e.g., because some pixel threads might not be able to be initiated until processing of a relevant vertex thread has been completed). Thus, one selection rule might always choose a vertex thread over a pixel thread.

ciao,
Marco
 
Rys said:
Yes, the vertex unit is simpler, but right now that's only because it has less ability than a fragment unit. Unify the caps of vertex and fragment hardware and (pretty much) you're just building the same unit in silicon.
"Less Ability"? Vertex shaders aren't "inferior" to pixel shader units, they're just different, even on NV4x. Pixel shaders are good at hiding texture latency, whereas vertex shaders are good at branching. Having both together is really expensive: see R520 vs G70 transistor counts, and R520 PS's branching isn't as good as NV4x's VS branching on non-coherent loads!

geo said:
And yet who do we see with the really robust scheduler in hardware so far?
Don't confuse lackluster branching performance with anemic instruction scheduling. G70's scheduler is pretty darn good at scheduling the pipeline it has to work with.
 
Will USA scale better?


Using the C1 as an example. What is the relationship, in number of transistors, between the portions of the chip devoted to scheduling and control logic to the ALU arrays and texture units and how do they scale? Does this give it an advantage compared to non USA designs whereby the transistor count will not increase as rapidly?
 
Thanks for coming back Sigma :) Thanks for you perspective on the software front. I really didn't take what you mentioned into account. Thanks.
 
Bob said:
"Less Ability"? Vertex shaders aren't "inferior" to pixel shader units, they're just different, even on NV4x. Pixel shaders are good at hiding texture latency, whereas vertex shaders are good at branching. Having both together is really expensive: see R520 vs G70 transistor counts, and R520 PS's branching isn't as good as NV4x's VS branching on non-coherent loads!

Talking SM3.0 across the board here:

I said simpler, not 'inferior' (your words ;) ) and yes, there's less general processing ability in the vertex hardware (which is what I was talking about but maybe didn't make that clear) in terms of its instruction set compared to the fragment unit.

I won't argue with your assessment of their relative merits in terms of branching and latency hiding from texture sampling, but I maintain the VS silicon (in NV hardware at least) is simpler in a processing of data sense. It can do less than a fragment unit and I maintain (happy to be corrected) that means less silicon and transistor budget.
 
zeckensack said:
The real reason you want to do it is because it achieves pretty much perfect load balancing. That's not to say that I think it's easy to pull this off. But once you do, this is the benefit you're going to get. It's all about performance, not about features or the programming model.
I totally agree with you, but there is definately a connection to features, namely vertex texturing and dynamic branching.

Current games are not very vertex heavy, so IHV's can get away with fewer vertex units. Add the fact that their operations are pure arithmetic, and they don't need much latency hiding, resulting in compactness. I don't think there is that much space to save by eliminating the VS, so generally speaking, I think gains of a USA in today's games will not be very big.

It's tough to say, because I know current architechtures have a big FIFO between the projection unit and PS to minimize the impact of small/zero-pixel triangles clusters, and this can be nearly eliminated with good load balancing. On the other hand, the additional data routing and scheduling needed is big, so that would likely take up plenty of space.

Throw vertex texturing into the picture, and suddenly you vertex units must be a lot bigger if they're going to perform well. Good dynamic branching needs a good scheduler, so you probably don't need a whole lot more to throw vertex management into the mix. Now USA makes a lot more sense, since the die savings are bigger, and thus you can add more ALUs like Xenos did.

So yes, it is all about performance, but performance depends on the load, which depends on the features used, which depend on the shader model and how developers use it. So in the ends, it's not really all about the performance.

BTW, many of the explanations above are not directed at you, since you're probably more knowledgeable than I am. They're for other members' benefit.
 
Rys said:
I said simpler, not 'inferior' (your words ) and yes, there's less general processing ability in the vertex hardware (which is what I was talking about but maybe didn't make that clear) in terms of its instruction set compared to the fragment unit.
What can the PS do that the VS can't, outside of texturing? Most of the differences between what the PS and VS do are things that don't really fit on the other. What does it mean to kill a vertex? What does it mean to compute ddx/ddy on vertices, when there is no connectivity information?
 
scificube said:
In contrast the setup engine feeds vertex pipes directly and quickly so that latency issues are greatly reduced.
It's actually the other way around. Vertex processing happens before setup and rasterization.

Possible reasons to keep vertex and pixel shading separate might be MIMD vs. SIMD and Vec4 + scalar vs. Vec3 + scalar. In other words Nvidia currently has MIMD vertex shaders and SIMD pixel shaders.

Look to 3dlabs for another possible reason to avoid hardware unification. IIRC, their vertex shaders are 36 bit and the pixel shaders are 32 bit.

I'm sure there are many more potential reasons.
 
Bob said:
What can the PS do that the VS can't, outside of texturing? Most of the differences between what the PS and VS do are things that don't really fit on the other. What does it mean to kill a vertex? What does it mean to compute ddx/ddy on vertices, when there is no connectivity information?
I hung myself when I explicitly mentioned instruction set, since the differences in SM3.0 VS and PS are fairly small if you discount texturing instructions and instructions like dsx/dsy (dp2add is pretty much it).

And if I mention instruction slots (544/4096) I'll get shot down with "people don't run long shaders", and the flow control caps of PS and VS in recent NVIDIA hardware are equivalent.

I had a notion about unit complexity that stemmed back to instructions NV30 supported natively without macros, which I should probably throw out of the window now. Point taken :D
 
Demirug said:
I would 16 pixel processing not call MIMD. It is less SIMD as nVidia but still SIMD.
All of R300+'s quads work on completely separate regions (tiles), hence they are unrelated in terms of the commands they are working on - a single quad is SIMD, yes (as with any graphics processor), but across the quads its MIMD.
 
Dave Baumann said:
All of R300+'s quads work on completely separate regions (tiles), hence they are unrelated in terms of the commands they are working on - a single quad is SIMD, yes (as with any graphics processor), but across the quads its MIMD.
You can say the same about NV40/G70 too, on a single batch is SIMD, but it's MIMD across batches (even if batches don't correspond to tiles or particular frame buffer areas)
 
Dave Baumann said:
In 2D how many batches are being executed, do you think?
On G70 more than one at the same time, maybe just one batch on NV40 (AFAIK)
Maybe Bob can help us here..
 
Dave Baumann said:
Then, surely, thats a MIMD pipeline.
It's if you like to strech definitions. It's not entirely MIMD and it's not entirely SIMD.
EDIT: I'm sure we could even say a double cores pentium it's a MIMD processor then..
 
Last edited:
Back
Top