I just have some thoughts I'd like someone more qualified than me to take a look at.
We know ATI is electing to support unified shaders at the hardware level. What about Nvidia?
IIRC Nvidia recently stated they would support unified shaders but I do not recall if they meant at the form of unified pipes. I do recall how in the past Nvida seemed to argue that it was still better to have pipelines dedicated pipelines rather than unified pipes. So I decided to try and explore why Nvidia would think like this.
So far I've seen people note that pixel pipes as they are now can deal with latency a great deal better than vertex pipes. I assume this is because pixel pipes must work on textures so much and that means going to memory, which means that by design they must be able to handle latency well. In contrast the setup engine feeds vertex pipes directly and quickly so that latency issues are greatly reduced. Talk on these boards is where I derived this and further talk about vertex texturing seems to highlight this difference. From what I understand texture lookups in vertex processing is (or one could argue that it is) so costly because vertex pipes are not equipped to deal with the latency of doing this very well (relative to pixel pipes)
This is just to give Nvidia a basis for their argument...more qualified persons may not see this as true or note other traits of traditional pipes that would have Nvidia take this position...at least in the past as they did.
...but what if they still hold that position now? Could Nvidia feasibly support unified shaders using traditional pipes and what would be the benefits/detractions to doing this?
I have a rather simplistic reasoning as to how Nvidia could support unified shaders in this manner and some equally simplistic ideas as to the pros/cons of doing this.
Unified with tradition:
1. Unified shaders are handled by SW/HW scheduler at either the thread or instruction level.
2. There are still distinct pixel and vertex pipes.
3. The number of vertex pipes in any architecture supporting unified shaders is increased as follows
a. in a 1:1 fashion with the number of pixel pipes
b. the number of vertex pipes is increased significantly to meet:
b-1 what ATI or rather an architecture that has unified it's pipes
can do when NOT limited by the setup engine (referrence to # of polys only)
b-2 the vertex load devs are moving toward pushing in their games
b-3 somewhere between b-1 and b-2 if there is a difference
4. The setup engine is beefed up so that it is not a bottleneck. (if need be...)
I really don't know how unified shaders work or for that matter shaders in the overall so I'm just going to describe what makes sense to me. A shader is code. This code can be executed in one or many threads on the same or different data sets pertaining to different pixels or triangles (I've tried but I don't understand what a batch is yet).
It seems self-defeating to me that shaders would contain only pixel or vertex work as unified shaders should be making it easier on programmers so that both types of instructions can be interleaved into a single shader. What is important is that it works and it's fast to the programmer. Not how it's done (for the sake of argument at the moment). With that said a when a shader is made aware to the scheduler, the scheduler decides which pipe or pipes it should execute on. I would like to refine my thoughts a bit here. A shader is code. This code is executed in a task where that task has 1 or more threads associated with it. The scheduler looks at the code in a shader and forks off threads to be executed on either a pixel pipe or a vertex pipe given the instruction mix it sees taking into account dependencies etc. Threads execute until the task at hand is complete.
This is just my crude reasoning as to how it may be able to work. I don't claim to have a clue here...just bear with me. This was just to establish how unified shaders could be run on traditional pipes and be abstracted from the programmer's view.
The benefits to doing this would be:
1. Pipes would still be finely tuned to the specific tasks of pixel work and vertex work respectively allowing for greater efficiency that can translate into better performance or savings elsewhere.
2. One can save the R&D cost of trying to make good unified pipes.
detractors would be:
1. One effectively tosses load balancing and all it's benefits out the window with one notable benefit being the ability to save on the # of transistors needed to get the job done.
2. The scheduler will need to be darn smart and thus will probably have allot of complex logic or if this in the driver the CPU had better be REALLY fast.
offsets to detractors:
1. If performance is the name of the game as it always is at some level...you should have the performance advantage at the cost of a higher transistor count. All the pixel and vertex pipes are on the ready to be fully utilizedt, as you are not re-tasking them to either pixel or vertex work and thus not robbing peter to pay paul. In fact a clear performance advantage would be quite evident in cases where the load of BOTH vertex and pixel work were very high so that an architecture with unified pipes simply had nowhere to shift the load and thus ease the pain.
2. In the cases where performance was not name of game but cost efficiency one could aim for a lower theoretical performance than a unified pipe architecture but due to the fact that you have dedicated pipes can gamble that the afore mentioned will happen in that you will have more power at the ready to offset the performance advantage of unified pipe architecture when thinking about consistent heavy loads with respect to pixel and vertex processing.
3. One may get away with using fewer trannies, as your pipes may be simpler in not having to handle both vertex and pixel work.
Ideas Compacted:
Using unified pipes to handle unified shaders would allow for savings in the transistor budget due to load balancing, but in the case the loads are too great load balancing breaks.
Using traditional pipes to handle unified shaders would cost more transistors but would allow for more power to be at the ready in the architecture. When the loads are too high the part will would still stall as a unified architecture would but the tipping point could be much higher in that the architecture could handle greater pixel and vertex loads per unit time due to reserved power to those ends.
Questions:
Should I put the pipe down? If so tell me which way is up before you walk away?
Aren't vertex pipes simpler than pixel pipes? If so they should require fewer trannies? Is the difference "enough"?
Could this work at 90nm or is 65nm or below the only hope for this approach?
I think it's easier to have a pixel pipe handle vertex work than the other way around. When looking at Xenos, could this be a reason why vertex texturing is so good (so we've heard)...pixel pipes would be tuned to handle latency and if reworked to handle vertex this may explain things. Would regular vertex work suffer then if unified pipes were actually pixel pipes beneath the skin? (of course it would be hard to notice given load balancing, but relatively speaking could this be true)
Please discuss
We know ATI is electing to support unified shaders at the hardware level. What about Nvidia?
IIRC Nvidia recently stated they would support unified shaders but I do not recall if they meant at the form of unified pipes. I do recall how in the past Nvida seemed to argue that it was still better to have pipelines dedicated pipelines rather than unified pipes. So I decided to try and explore why Nvidia would think like this.
So far I've seen people note that pixel pipes as they are now can deal with latency a great deal better than vertex pipes. I assume this is because pixel pipes must work on textures so much and that means going to memory, which means that by design they must be able to handle latency well. In contrast the setup engine feeds vertex pipes directly and quickly so that latency issues are greatly reduced. Talk on these boards is where I derived this and further talk about vertex texturing seems to highlight this difference. From what I understand texture lookups in vertex processing is (or one could argue that it is) so costly because vertex pipes are not equipped to deal with the latency of doing this very well (relative to pixel pipes)
This is just to give Nvidia a basis for their argument...more qualified persons may not see this as true or note other traits of traditional pipes that would have Nvidia take this position...at least in the past as they did.
...but what if they still hold that position now? Could Nvidia feasibly support unified shaders using traditional pipes and what would be the benefits/detractions to doing this?
I have a rather simplistic reasoning as to how Nvidia could support unified shaders in this manner and some equally simplistic ideas as to the pros/cons of doing this.
Unified with tradition:
1. Unified shaders are handled by SW/HW scheduler at either the thread or instruction level.
2. There are still distinct pixel and vertex pipes.
3. The number of vertex pipes in any architecture supporting unified shaders is increased as follows
a. in a 1:1 fashion with the number of pixel pipes
b. the number of vertex pipes is increased significantly to meet:
b-1 what ATI or rather an architecture that has unified it's pipes
can do when NOT limited by the setup engine (referrence to # of polys only)
b-2 the vertex load devs are moving toward pushing in their games
b-3 somewhere between b-1 and b-2 if there is a difference
4. The setup engine is beefed up so that it is not a bottleneck. (if need be...)
I really don't know how unified shaders work or for that matter shaders in the overall so I'm just going to describe what makes sense to me. A shader is code. This code can be executed in one or many threads on the same or different data sets pertaining to different pixels or triangles (I've tried but I don't understand what a batch is yet).
It seems self-defeating to me that shaders would contain only pixel or vertex work as unified shaders should be making it easier on programmers so that both types of instructions can be interleaved into a single shader. What is important is that it works and it's fast to the programmer. Not how it's done (for the sake of argument at the moment). With that said a when a shader is made aware to the scheduler, the scheduler decides which pipe or pipes it should execute on. I would like to refine my thoughts a bit here. A shader is code. This code is executed in a task where that task has 1 or more threads associated with it. The scheduler looks at the code in a shader and forks off threads to be executed on either a pixel pipe or a vertex pipe given the instruction mix it sees taking into account dependencies etc. Threads execute until the task at hand is complete.
This is just my crude reasoning as to how it may be able to work. I don't claim to have a clue here...just bear with me. This was just to establish how unified shaders could be run on traditional pipes and be abstracted from the programmer's view.
The benefits to doing this would be:
1. Pipes would still be finely tuned to the specific tasks of pixel work and vertex work respectively allowing for greater efficiency that can translate into better performance or savings elsewhere.
2. One can save the R&D cost of trying to make good unified pipes.
detractors would be:
1. One effectively tosses load balancing and all it's benefits out the window with one notable benefit being the ability to save on the # of transistors needed to get the job done.
2. The scheduler will need to be darn smart and thus will probably have allot of complex logic or if this in the driver the CPU had better be REALLY fast.
offsets to detractors:
1. If performance is the name of the game as it always is at some level...you should have the performance advantage at the cost of a higher transistor count. All the pixel and vertex pipes are on the ready to be fully utilizedt, as you are not re-tasking them to either pixel or vertex work and thus not robbing peter to pay paul. In fact a clear performance advantage would be quite evident in cases where the load of BOTH vertex and pixel work were very high so that an architecture with unified pipes simply had nowhere to shift the load and thus ease the pain.
2. In the cases where performance was not name of game but cost efficiency one could aim for a lower theoretical performance than a unified pipe architecture but due to the fact that you have dedicated pipes can gamble that the afore mentioned will happen in that you will have more power at the ready to offset the performance advantage of unified pipe architecture when thinking about consistent heavy loads with respect to pixel and vertex processing.
3. One may get away with using fewer trannies, as your pipes may be simpler in not having to handle both vertex and pixel work.
Ideas Compacted:
Using unified pipes to handle unified shaders would allow for savings in the transistor budget due to load balancing, but in the case the loads are too great load balancing breaks.
Using traditional pipes to handle unified shaders would cost more transistors but would allow for more power to be at the ready in the architecture. When the loads are too high the part will would still stall as a unified architecture would but the tipping point could be much higher in that the architecture could handle greater pixel and vertex loads per unit time due to reserved power to those ends.
Questions:
Should I put the pipe down? If so tell me which way is up before you walk away?
Aren't vertex pipes simpler than pixel pipes? If so they should require fewer trannies? Is the difference "enough"?
Could this work at 90nm or is 65nm or below the only hope for this approach?
I think it's easier to have a pixel pipe handle vertex work than the other way around. When looking at Xenos, could this be a reason why vertex texturing is so good (so we've heard)...pixel pipes would be tuned to handle latency and if reworked to handle vertex this may explain things. Would regular vertex work suffer then if unified pipes were actually pixel pipes beneath the skin? (of course it would be hard to notice given load balancing, but relatively speaking could this be true)
Please discuss
Last edited by a moderator: