Unified Shaders: With traditional pipes?

scificube

Regular
I just have some thoughts I'd like someone more qualified than me to take a look at.

We know ATI is electing to support unified shaders at the hardware level. What about Nvidia?

IIRC Nvidia recently stated they would support unified shaders but I do not recall if they meant at the form of unified pipes. I do recall how in the past Nvida seemed to argue that it was still better to have pipelines dedicated pipelines rather than unified pipes. So I decided to try and explore why Nvidia would think like this.

So far I've seen people note that pixel pipes as they are now can deal with latency a great deal better than vertex pipes. I assume this is because pixel pipes must work on textures so much and that means going to memory, which means that by design they must be able to handle latency well. In contrast the setup engine feeds vertex pipes directly and quickly so that latency issues are greatly reduced. Talk on these boards is where I derived this and further talk about vertex texturing seems to highlight this difference. From what I understand texture lookups in vertex processing is (or one could argue that it is) so costly because vertex pipes are not equipped to deal with the latency of doing this very well (relative to pixel pipes)

This is just to give Nvidia a basis for their argument...more qualified persons may not see this as true or note other traits of traditional pipes that would have Nvidia take this position...at least in the past as they did.

...but what if they still hold that position now? Could Nvidia feasibly support unified shaders using traditional pipes and what would be the benefits/detractions to doing this?

I have a rather simplistic reasoning as to how Nvidia could support unified shaders in this manner and some equally simplistic ideas as to the pros/cons of doing this.

Unified with tradition:

1. Unified shaders are handled by SW/HW scheduler at either the thread or instruction level.

2. There are still distinct pixel and vertex pipes.

3. The number of vertex pipes in any architecture supporting unified shaders is increased as follows
a. in a 1:1 fashion with the number of pixel pipes
b. the number of vertex pipes is increased significantly to meet:
b-1 what ATI or rather an architecture that has unified it's pipes
can do when NOT limited by the setup engine (referrence to # of polys only)
b-2 the vertex load devs are moving toward pushing in their games
b-3 somewhere between b-1 and b-2 if there is a difference

4. The setup engine is beefed up so that it is not a bottleneck. (if need be...)

I really don't know how unified shaders work or for that matter shaders in the overall so I'm just going to describe what makes sense to me. A shader is code. This code can be executed in one or many threads on the same or different data sets pertaining to different pixels or triangles (I've tried but I don't understand what a batch is yet).

It seems self-defeating to me that shaders would contain only pixel or vertex work as unified shaders should be making it easier on programmers so that both types of instructions can be interleaved into a single shader. What is important is that it works and it's fast to the programmer. Not how it's done (for the sake of argument at the moment). With that said a when a shader is made aware to the scheduler, the scheduler decides which pipe or pipes it should execute on. I would like to refine my thoughts a bit here. A shader is code. This code is executed in a task where that task has 1 or more threads associated with it. The scheduler looks at the code in a shader and forks off threads to be executed on either a pixel pipe or a vertex pipe given the instruction mix it sees taking into account dependencies etc. Threads execute until the task at hand is complete.

This is just my crude reasoning as to how it may be able to work. I don't claim to have a clue here...just bear with me. This was just to establish how unified shaders could be run on traditional pipes and be abstracted from the programmer's view.

The benefits to doing this would be:

1. Pipes would still be finely tuned to the specific tasks of pixel work and vertex work respectively allowing for greater efficiency that can translate into better performance or savings elsewhere.

2. One can save the R&D cost of trying to make good unified pipes.

detractors would be:

1. One effectively tosses load balancing and all it's benefits out the window with one notable benefit being the ability to save on the # of transistors needed to get the job done.

2. The scheduler will need to be darn smart and thus will probably have allot of complex logic or if this in the driver the CPU had better be REALLY fast.

offsets to detractors:

1. If performance is the name of the game as it always is at some level...you should have the performance advantage at the cost of a higher transistor count. All the pixel and vertex pipes are on the ready to be fully utilizedt, as you are not re-tasking them to either pixel or vertex work and thus not robbing peter to pay paul. In fact a clear performance advantage would be quite evident in cases where the load of BOTH vertex and pixel work were very high so that an architecture with unified pipes simply had nowhere to shift the load and thus ease the pain.

2. In the cases where performance was not name of game but cost efficiency one could aim for a lower theoretical performance than a unified pipe architecture but due to the fact that you have dedicated pipes can gamble that the afore mentioned will happen in that you will have more power at the ready to offset the performance advantage of unified pipe architecture when thinking about consistent heavy loads with respect to pixel and vertex processing.

3. One may get away with using fewer trannies, as your pipes may be simpler in not having to handle both vertex and pixel work.

Ideas Compacted:

Using unified pipes to handle unified shaders would allow for savings in the transistor budget due to load balancing, but in the case the loads are too great load balancing breaks.

Using traditional pipes to handle unified shaders would cost more transistors but would allow for more power to be at the ready in the architecture. When the loads are too high the part will would still stall as a unified architecture would but the tipping point could be much higher in that the architecture could handle greater pixel and vertex loads per unit time due to reserved power to those ends.

Questions:

Should I put the pipe down? If so tell me which way is up before you walk away?

Aren't vertex pipes simpler than pixel pipes? If so they should require fewer trannies? Is the difference "enough"?

Could this work at 90nm or is 65nm or below the only hope for this approach?

I think it's easier to have a pixel pipe handle vertex work than the other way around. When looking at Xenos, could this be a reason why vertex texturing is so good (so we've heard)...pixel pipes would be tuned to handle latency and if reworked to handle vertex this may explain things. Would regular vertex work suffer then if unified pipes were actually pixel pipes beneath the skin? (of course it would be hard to notice given load balancing, but relatively speaking could this be true)

Please discuss :)
 
Last edited by a moderator:
scificube said:
Questions:

Should I put the pipe down? If so tell me which way is up before you walk away?
Not at all. The discussion about whether NVIDIA will unify in software only is very valid. The idea of a 'pipeline' though (which is what I first thought you meant :LOL: ) is getting tired. A shader 'pipe' is a SIMD or MIMD streaming processor, so for me 'shader ALUs' is good language for it, or 'shader processors'.

As for whether NVIDIA will hide the unification in software with 'split' hardware, yes, I think they will. Although I'll join that bit of the discussion later on.

scificube said:
Aren't vertex pipes simpler than pixel pipes? If so they should require fewer trannies? Is the difference "enough"?
Yes, the vertex unit is simpler, but right now that's only because it has less ability than a fragment unit. Unify the caps of vertex and fragment hardware and (pretty much) you're just building the same unit in silicon.

scificube said:
Could this work at 90nm or is 65nm or below the only hope for this approach?
You can do it at any process node. It all depends how many units you want to build, how wide you go and how complex the rest of your peripheral logic is. Obviously the smaller the process the larger your transistor budget and the easier (mostly, with caveats) it gets.

scificube said:
I think it's easier to have a pixel pipe handle vertex work than the other way around. When looking at Xenos, could this be a reason why vertex texturing is so good (so we've heard)
Texture address and sampling is decoupled from the shader hardware in C1/Xenos. So the scheduler starts texturing work separately from the shader work, but with express decision to work with texturing latency so that results are ready when the shader unit needs it. That's the reason C1's vertex texturing will likely be very high performance.

scificube said:
...pixel pipes would be tuned to handle latency and if reworked to handle vertex this may explain things. Would regular vertex work suffer then if unified pipes were actually pixel pipes beneath the skin? (of course it would be hard to notice given load balancing, but relatively speaking could this be true)
The unified unit is just that, unified. So vertex caps are equal to fragment processing caps, so vertex work shouldn't suffer if the rest of the chip is smart. The only consideration there should be is when running older shader model code where vertex processing is less able than fragment, but since one's a superset of the other biased to fragment processing, it shouldn't be an issue.

scificube said:
Please discuss :)
Awesome topic, this one should run and run.
 
scificube said:
If performance is the name of the game as it always is at some level...you should have the performance advantage at the cost of a higher transistor count.
This is the wrong way to think about it in a thought experiment, you have to compare the two methods at the same transistor count (or rather, die size).

You are completely reversing what unified shaders are. Unified shaders for the near future will be an implementation detail, it is invisible for the developer. The best case for a non unified architecture is not "high" loads, whatever that might mean, it's perfectly balanced loads for it's pixel/vertex shader division.
 
Last edited by a moderator:
Good post Rys!

I need to come to grips with this caps idea. I was thinking the difference structurally between vertex and pixel pipelines was more significant than what I think you are trying to describe. I suppose if I distance myself from the pipeline concept and focus only on ALUs this makes allot more sense.

I'm actually surprised I wasn't scolded immediately with this. Guess I keep smoking the stuff a little longer then :)

Edit:

misspelled...and no I didn't feel scolded Mfa :)
 
Last edited by a moderator:
MfA said:
This is the wrong way to think about it in a thought experiment, you have to compare the two methods at the same transistor count (or rather, die size).

You are completely reversing what unified shaders are. Unified shaders for the near future will be an implementation detail, it is invisible for the developer. The best case for a non unified architecture is not "high" loads, whatever that might mean, it's perfectly balanced loads for it's pixel/vertex shader division.

hmmm...I'll say first that I was just trying to find some benefits for going this route...I was looking for where things would be well...unfair or rather unequal. That's all.

I tried to describe an implementation that would be invisible to the developer at least as far as coding unified shaders go. I then tried to think from Nvidia's perspective and that is where I came to thoughts about unequal transistor budgets. What I meant by high loads is this. I was thinking off three distinct scenes. One dominated by pixel work, another dominated by vertex work, and a third scene where the was significant and heavy work to be done on both vertices and pixels. In the first two scenes unified shader processors :))) could swing resources to meet the task at hand. In the third scene the is no way to re-shape the battle field as the task at hand is demanding on both fronts. It was case three in where I saw an advantage for more resources being available to use over resources that could be reallocated from doing other things. I'm only looking for advantages to try and understand Nvidia past position and where they might go.

As far as the best case for a developer. (You be a dev right? DOH!...I meant no disrespect of course when I "said " the following.) I would think they would shoot to meet what the HW could give them as long as it in accordance to what they want to do. If unified shaders work in both instances then devs then look to just how much the hardware will give them no? In a console it may be unified shader with traditional shading processors so as to leverage as much power as possible that devs prefer. In a cell phone it may be unified shaders with unified pipes because trannies can be saved cutting down on precious power consumption.(actually this is more a concern of the entity making the cell phone) In the PC space it's often code to the lowest common denominator, but then there are those who dare...so I've no idea.

I just want to say the idea doesn't escape me. I'm just trying to guess what Nvidia may do so I took I tried to take as much as I could understand into account. Maybe that's a bad thing but that's what I was doing.
 
Last edited by a moderator:
scificube said:
coding unified shaders
That's your problem right there ... the instruction set is being unified, but developers will still not "code unified shaders". NVIDIA doesn't have to emulate a thing even if they stick to separate pixel&vertex shaders.
I then tried to think from Nvidia's perspective and that is where I came to thoughts about unequal transistor budgets.
That still doesnt make sense, why would NVIDIA have a higher transistor budget? A larger chip is a more expensive chip ... regardless of whether it uses an unified shader architecture or not.

The case for a non unified architecture is simple, pixel and vertex shaders do different types of work ... if you specialize them you can make them smaller.

The case for a unified archtecture is equally simple, the workloads are dynamic ... you can justify making larger general shader units, and thus have less total at the same die size, because on average the increase in utilization will still provide a net benefit
 
Last edited by a moderator:
MfA said:
That's your problem right there ... the instruction set is being unified, but developers will still not "code unified shaders".
That still doesnt make sense, why would NVIDIA have a higher transistor budget? A larger chip is a more expensive chip ... regardless of whether it uses an unified shader architecture or not.

edit...gotta try again...

pure lapse of judgement I didn't think about coding unified shaders in term present and future at all...(I'm on a roll folks...not.) I suppose you mean presently, but this will change after Vista arrives with DX10. So unified shaders don't exist then. Got it. With this being the case it makes it so the scheduler can be "dumber" for the moment at least. That's about the only major change needed there I hope. I understand what you mean by "completely on the implementation side" as of right now. Sorry.

As for why Nvidia would want a higher transistor budget? To compete or dominate on the performance front...at least at the high end. The idea is to use more vertex shader processors to offset the significant peak vertex processing advanged unified shader processors would have. In modern GPUs there are already allot of resources dedicated to pixel processing so there is no need to compensate there. If the performance advantage translates into more chips being sold (or Nvidia see how this could be leveraged to sell more chips) then Nvidia may consider this a win despite the extra cost.

I think I covered your last two points in my original post though not as succintly.
 
Last edited by a moderator:
scificube said:
I did not know that the instruction set was not being unified...
Grrr ... I said the instruction set WAS being unified. Look you have vertex and pixel shaders, the fact that they use the same instruction set doesnt force you to use a unified architecture or emulate one.
As for why Nvidia would want a higher transistor budget?
How is this relevant to the architecture? There isn't some magical limit on how large you can make a chip with an unified architecture, you can make them just as large as one with seperate vertex and pixel shaders ... it is completely irrelevant.
 
MfA said:
Grrr ... I said the instruction set WAS being unified. Look you have vertex and pixel shaders, the fact that they use the same instruction set doesnt force you to use a unified architecture or emulate one.
How is this relevant to the architecture? There isn't some magical limit on how large you can make a chip with an unified architecture, you can make them just as large as one with seperate vertex and pixel shaders ... it is completely irrelevant.

I was correcting that while you posted :)
...you're not gonna hit me are you?

as for your second question. You are correct in that there is no define limit, but it goes both ways. I was thinking in terms of expected performance. If Nvidia guessed wrong...they lose, just as in any other situation. I was thinking that ATI would leverage at least attempt to leverage that ability to use less silicon and still have the better performer on the market. If I don't use reasonable constraints the in both instances then the chips get infinitely large and infinitely powerful barring things that would never let that happen...like heat.

edit:
I'm not trying to make everthing relative to architecture in terms of simply enabling unified shaders. I'm tryin to get inside Nvidia's head and go beyond that to what they would do in trying to deal with ATI's offerings to the same end.
 
Last edited by a moderator:
Yay!
Unified shading hardware is not done because it allows having unified shading languages between the vertex and pixel level. It doesn't. Unified languages can be had with a traditional architecture, too.

The real reason you want to do it is because it achieves pretty much perfect load balancing. That's not to say that I think it's easy to pull this off. But once you do, this is the benefit you're going to get. It's all about performance, not about features or the programming model.
 
zeckensack said:
Yay!
Unified shading hardware is not done because it allows having unified shading languages between the vertex and pixel level. It doesn't. Unified languages can be had with a traditional architecture, too.

The real reason you want to do it is because it achieves pretty much perfect load balancing. That's not to say that I think it's easy to pull this off. But once you do, this is the benefit you're going to get. It's all about performance, not about features or the programming model.

That makes sense. I'm still left asking asking why Nvidia would want to stick with a traditional architecture though? I could only look to the programming model and efficiency of traditional pipes (or whatever they should be called now). If not for the these reasons, I need help understanding Nvidia position then. They surely had some sensible reason for taking the position they did.
 
NVIDIA have considerable experience and expertise in designing fast GPUs that split vertex and fragment processors on a design and silicon level. The reason they probably won't go unified in hardware for their first DX10 part is simply to leverage that existing experience and expertise one last time, before they make the switch (which I'm confident they will after a further generation of 'split' silicon). It gives them a time buffer and if it ain't broke, don't fix it rings true for GPUs quite a bit.
 
scificube said:
If not for the these reasons, I need help understanding Nvidia position then. They surely had some sensible reason for taking the position they did.
What exactly are you basing "Nvidia's position" on? I don't think Nvidia has made their position clear, but I do think an interview with David Kirk revealed his opinion (and it can be extrapolated that this is also Nvidia's opinion) that it was not obvious that a unified shader architecture was optimal at this point in time, or the near future, as it were.
 
I said this previously in another thread, so here it goes again: I think that maybe NVIDIA has another solution to the unified shaders scenario, and part of that can be seen in the multiple clocks of the G70. For example: one VS running at 1GHz and 16 PS running at 700MHz. With this solution, they can mantain dedicated units to the vertex and pixel shaders and yet save some space in the units they use...
 
I dont think making specialized shaders will matter all that much for clockspeed, more logic for each shader unit yes ... but nothing that couldnt be pipelined.
 
scificube said:
That makes sense. I'm still left asking asking why Nvidia would want to stick with a traditional architecture though?... They surely had some sensible reason for taking the position they did.

The reason is simple: the resources needed for such a step are rather huge, they'd have to start from scratch more or less. They just didn't see the (still to be proven) benefits of a unified architecture as reason enough to invest that much and maybe still have a slower solution afterwards.

Kirk said in his interview that they always evaluate different architectureal approaches before kicking off a new design decision and he said the unified architecture didn't seem feasible at this point in time (kinda like "too early and not worth the effort yet" answer).
 
Sorry guys I just had to get some sleep. I'm pretty sorry I missed Sigma's post as I would've asked for some elaboration. I did not consider clocking pipes at different speeds. This would help on the performance outlook but then Nvidia has to do serious work to get the vertex pipes to run faster perhaps multiple times faster than their pixel pipes. I see problems with this is the long run. I could be wrong though given I have not seen all of Sigma's thoughts on this. Sounds interesting though.

Anyway thanks for the replies thus far. It certainly seems possible I took Kirk's words the wrong way in that Nvida still likes discrete pipes only "for the moment" and not as an alternative to unified shader processors in the future. That certainly would make allot of sense.
 
Last edited by a moderator:
The thing that really strikes me, in an ironic kind of way, is that if one really wanted a high-performing non-unified-at-the-silicon level part --in other words, what NV is pointing at-- that the best way to accomplish that would be with a really robust scheduler in the hardware to drive the maximizing of unit use and instruction routing down to the hardware level and simplify your driver overhead (which, it seems is received wisdom, is always slower) for getting there.

And yet who do we see with the really robust scheduler in hardware so far? :LOL:

Edit: Come to think of it, you need the scheduler either way, in my opinion, and probably the unified-in-silicon one is more robust, as it is one-level-down in finding/using units --but, still, a robust hardware scheduler seems like a Very Good Idea for NV's path as well.
 
Last edited by a moderator:
Yeah, I'd say ATI has it's ducks in a row going with the ultra-threaded concept. It just makes too much sense...it has to if I can undestand it :)
 
scificube said:
I'm pretty sorry I missed Sigma's post as I would've asked for some elaboration. I did not consider clocking pipes at different speeds. This would help on the performance outlook but then Nvidia has to do serious work to get the vertex pipes to run faster perhaps multiple times faster than their pixel pipes. I see problems with this is the long run. I could be wrong though given I have seen all of Sigma's thoughts on this. Sounds interesting though.

Well, I really can't elaborate more on it because it isn't my area. :) It was just something that came to mind. How NVIDIA (or anyone else for that matter) could accomplish this in hardware, or if it is a good ideia, is beyond me.
As a software developer I really don't see a huge advantage for unified shaders. HLSL/GLSL already help on that and with vertex texturing the gap is very small. At the hardware level, that is another matter and I understand the benefits of an unified architecture, but I think a solution like I said previously could also work.
We cannot forget about the upcoming topology/geometry processor and the vertex shaders are really tied to it. I can't really be sure, but I think I saw in a presentation that there will be something like 2 stages for vertex shader, one after the primitive assembler and one after the topologie shader...
 
Back
Top