If this is your first visit, be sure to check out the FAQ by clicking the link above. You may have to register before you can post: click the register link above to proceed. To start viewing messages, select the forum that you want to visit from the selection below.
![]() |
|
|
#1 | |
|
Member
|
First, a comment from one of the guys on 3DGPU forums:
Quote:
http://graphics.stanford.edu/papers/rtongfx/ In previous threads, it was noted by some people (not all) that flow control is not a big deal... This guy makes it sound like it is! Any ideas? Also, a couple of things were suggested as to what NV30 is likely to have: 1. "The one advantage that NV30 will have over R300 is the programmable triangle tessellation unit. It's a fantastic feature but unfortunately one that will not be taken advantage of by game developers in its product lifecycle. (Heck, games are just now starting to take advantage of shaders which have been around since NV20!) Still, it's nice to see this type of innovation because it paves the way for making this a standard feature in future designs" 2. "A virtually unlimited number of pixel shader instructions" The reason I'm bringing it all up is because I'm intrested in knowing how NV30 will stand up against R300. |
|
|
|
|
|
|
#2 |
|
Join Date: May 2002
Location: New York, NY
Posts: 12,678
|
One thing that I can see is that if you don't have flow control in hardware, you should still be able to achieve the same affect by using another pass for each branch (i.e. each "if" statement would be a new pass). Obviously this might allow for a very significant performance improvement.
However, things are not all rosy when it comes to flow control. For one, any sort of branching can cause major stalls in any pipeline. One of the reasons 3D accelerators are so fast is that they are very predictable, allowing for very efficient caching. Flow control significantly reduces the predictability of 3D renderers, making smart caching much more challenging. While it may be true that flow control is better than multiple passes, it is not better than doing everything in one pass. One thing that does upset me about what some people have said about the R300 is that it is designed to do everything in one pass. Since it does have limited program sizes and no flow control in the pixel shader, it is certain that there are shaders that would take more than a single pass. Hopefully we'll have other hardware out soon that does not have these limitations. |
|
|
|
|
|
#3 |
|
Member
|
Thx for the info m8!
One thing I can safely say is that flow control will definetly be in NV30, I know that for certain, don't ask me from who or how cause I won't tell |
|
|
|
|
|
#4 |
|
Join Date: May 2002
Location: New York, NY
Posts: 12,678
|
I know that vertex shader flow control will be in the NV30 (from the Cg specs). But will it be in the pixel shader?
|
|
|
|
|
|
#5 | |
|
Member
|
Quote:
As I said, my source ROCKS!!!! |
|
|
|
|
|
|
#6 |
|
Regular
Join Date: Feb 2002
Location: California
Posts: 4,732
|
That paper is talking about raytracing. PS flow control only benefits that by reducing multipass. Lots of other features are discussed in that paper like PS writing stencil bits directly, early fragment kill, etc.
But flow control won't make any different for most of the shaders being used in games because no one is going to waste their time trying to do raytracing in a game. |
|
|
|
|
|
#7 |
|
Join Date: May 2002
Location: New York, NY
Posts: 12,678
|
Besides just reducing the number of passes, flow control reduces the need for automatic multipass rendering in HLSL's, which I think we've already decided can have a serious impact on performance.
|
|
|
|
|
|
#8 |
|
Junior Member
Join Date: Jul 2002
Posts: 91
|
However, things are not all rosy when it comes to flow control. For one, any sort of branching can cause major stalls in any pipeline. That's not true. The reason you normally have a major stall in a CPU pipeline is on a mispredicted branch. And this is only when you are speculatively executing instructions down the predicted path and are forced to flush them out of the pipeline and start fetching from the actual calculated path. The longer the pipeline, the longer the stall. This is why the BPU on a 4 stage CPU like the MPC7400 can be much simpler than lets say the P4 -- because the hit is negligible even when it happens. On these modern day GPU's it's even simpler because we're talking about only one operation being in the pipe at any given time. If there's a mispredict you might stall for a cycle at worst as you throw away your speculative fetch. You have to be careful in generalizing things you know about CPU's and applying them to GPU's because at the moment they are still significantly different in many ways. As far as the triangle tessellation unit (or the Programmable Primitive Processor - PPP as some people call it) there is no question we will see this on the NV30. Again, this is a fantastic idea and one that will revolutionize how we deal with things like continous level of detail on terrain meshes etc. The problem is that many of the games we're going to see in the next 3 years are going to be based on either the UT2k3 or the Doom3 engines and neither of them will take advantage of this feature. As I stated earlier, it'll help advance 3D in general and hopefully ATI also adds this to R400 but IMO it's not a feature that'll actually be utilized in this generation. |
|
|
|
|
|
#9 | |
|
Join Date: May 2002
Location: New York, NY
Posts: 12,678
|
Quote:
|
|
|
|
|
|
|
#10 |
|
Junior Member
Join Date: Jul 2002
Posts: 91
|
GPU's must attempt to put texture data in the caches before rendering. This is where most of the stalls would occur. Ok so you're talking about the pixel pipeline specifically then? I thought you were referring to the vertex pipeline since that's the one with flow control on the R300. Even in the pixel pipeline case, I'm not sure if this is entirely true. On traditional pixel shader programs the HW is constantly prefetching all of the textures that you reference. How intelligent will next-gen HW be? Will it look at the scope of the loop and only prefetch what is necessary? If so, then yes it will take a hit on a mispredict but we can safely say that for the most part we will see very high branch prediction rates regardless of whether it's dynamic or not. Additionally, there may be problems with keeping all of the pipelines flush if the GPU has a hard time figuring out how many clock cycles it will take each pixel pipeline to finish a pixel. See above. The HW has to enforce sync across all pipelines so perhaps they will always prefetch all textures referenced in the code. This gets rid of sync issues due to mispredict as well as the added HW complexity in analyzing the scope of the loop or speculatively evaluating both sides of the branch. |
|
|
|
|
|
#11 | ||
|
Join Date: May 2002
Location: New York, NY
Posts: 12,678
|
Quote:
As for filling texture cache, branching makes it so that there are more possible textures, and reading the wrong ones into cache will eat up performance. Yes, you could read all possible textures into the caches, but if some of those aren't used, you aren't making full use of the available performance. This approach also requires quite a bit more cache (particularly if you're talking about 128-bit source art, such as what you might want to use for bump maps). If you attempt to predict which textures are to be used, you are bound to cause stalls. I'm sure that vertex shaders will also run into problems related to branching, but I'm not completely certain what those might be. I suppose it would depend on the program. If it were possible in the vertex shader for the calcualtion of one vertex to be dependent upon the outcome of the previous vertex, then branching could be hell on parallelism. Quote:
|
||
|
|
|
|
|
#12 |
|
Regular
Join Date: Feb 2002
Location: California
Posts: 4,732
|
GPUs have incredibly deep pipelines in comparison to CPUs and no GPUs have branch prediction at the moment. A pipeline stall on a GPU would kill performance. Of course, the solution is to start to put branch prediction logic into the GPU. Again, this significantly complicates the GPU for no compelling reason (most of the pixel shading algorithms don't need this), and we know from history that branch prediction only takes you so far. Intel's Itanium is an attempt to solve the superscalar problems that branch prediction/speculative execution techniques have plateau-ed at.
(right now, pixel shaders are little more than multitexture register combiner units, and you guys are talking about adding speculative execution and branch prediction! Get REAL!) In essense, we'd come full circle, but burdening GPUs with the performance killing paradigms that CPUs have had to grapple with for years, meanwhile, IA-64 is attempting to shed this speculative execution baggage by embedding the logic into the compiler and removing it from the instruction set architecture that requires the CPU to do it right now on IA-32. When someone comes up with a overwhemlingly compelling pixel shader that needs this stuff, you might convince me, but I'd been doing RenderMan for years, and I haven't encountered anything that truly needed it. ATI's Renderman conversions to real time demos bear witness to this. Keep the pixel pipelines lean and mean. |
|
|
|
|
|
#13 | |
|
Senior Member
|
Quote:
I'm sure it doesn't have anything to do with switching speed or interconnect parsitics, in terms of limiting clock rate. Power might be a concern since it rises quadratically when frequency is scaled. I dunno just doesn't make much sense why they're not faster instead of largely going wider, mind you with graphics there is LOTS of room for performance gain in wide issue MPUs, but faster allows you to hide some memory latency and can help in increasing effective bandwidth. |
|
|
|
|
|
|
#14 |
|
Join Date: May 2002
Location: New York, NY
Posts: 12,678
|
As far as I'm concerned, there's no compelling reason currently to optimize for branches in the pixel pipeline, but it is pretty certain that there will be situations where you'll want to use those branches.
I think that, in the future, we will have GPUs with instruction sets similar to the IA64 that would put the branch prediction type stuff in the software. Anyway, my belief is that sometime in the future, we will have GPUs capable of flow control in the pixel shader, and we will also have pixel shaders written that will work best if the flow control is done in hardware. The sooner we have some base functionality for flow control, the sooner we will be able to see these shaders (The same goes for unlimited-length programs...at least, reasonably unlimited-length...). |
|
|
|
|
|
#15 |
|
Junior Member
Join Date: Jul 2002
Posts: 91
|
Yes, you could read all possible textures into the caches, but if some of those aren't used, you aren't making full use of the available performance.
Agreed, but the HW complexity of determining what to selectively prefetch might be too high. I suppose the compiler could calculate this ahead of time and encode this information in the ASIC specific micro-op, but then we get into all of the other associated problems. I'm sure that vertex shaders will also run into problems related to branching, but I'm not completely certain what those might be. I suppose it would depend on the program. I don't see this as an issue since vertex data is contigious. For arguments sake, assume each cacheline is 8 DWORDs long. This means that as long as 1 in 8 fetches from a particular buffer are referenced in a branch, it was worthwhile to perform the fetch. GPUs have incredibly deep pipelines in comparison to CPUs and no GPUs have branch prediction at the moment. DemoCoder, I don't think that's true. Why do you think the pipelines are incredibly deep? I would think the exact opposite is true as there is no reason for it to be deep. Aside from clock frequency, you don't need to have a lot of operations in-flight as there is no out-of-order execution, and every operation is single cycle execution not just throughput. ie a dependant instruction can execute on the very next cycle. This can be accomplished through data feed forwarding as well, but I really don't think GPU's are anywhere near this level of complexity today. The other thing to consider is your point about no GPU's having branch prediction today. The p10 and R300 both offer branching, yet no branch prediction unit. If the pipeline were indeed that deep, wouldn't this be a necessity? |
|
|
|
|
|
#16 |
|
Regular
Join Date: Feb 2002
Location: California
Posts: 4,732
|
Sigh,
You keep talking about vertex shaders, but it is clear from the context of these discussions from the beginning that we have been talking about the pixel pipeline and branching in that. We already know that DX9 contains branching in VS. The issue du jour is the performance implications of putting these operations in the pixel pipeline as well. Of course 3D pipelines are deep. Do you really think that you can go from the triangle setup stage to having a pixel written to the framebuffer in a single clock cycle? In order to achieve pixel-per-clock speed, the 3D pipeline has to hide all kinds of latencies, texture fetches, cache operations, floating point ops that might take more than 1 cycle, AGP accesses, etc. And if you want to discuss the full pipeline, as measured from the vertex unit all the way down to the framebuffer, it is way deep as compared to a CPU pipeline. For reference, the GeForce3 has about 600-800 pipeline stages (if you count all 4 pipelines). http://www.ce.chalmers.se/undergradu...2002/gfxhw.pdf (page 8 ) So we have incredibly deep pipelines, who's performance depends on heavily parallization and predictable memory access patterns, and now you want to introduce branching into this pipeline, which will disrupt memory access patterns, and you think it will be as simple as slapping a branch prediction unit into it? |
|
|
|
|
|
#17 |
|
Junior Member
Join Date: May 2002
Posts: 18
|
I agree with DemoCoder, pixel pipes are very deep and long which is why optimisations are still telling you to sort by state and avoid needless toggling between 2 pipeline states (each change might need you to flush the data out of the pipe).
Now we need to specifiy what we are talking about : constant branching or dynamic branching. Let me give an example of the difference. Constant branching in the vertex shader would allow you to enable a certain number of lights (different loop execute count) and enable environment mapping or a default mapping depending on the state of a constant bool. These branches are static because you set them per vertex batch. You set the bool/int constants and then you execute them on a batch of vertices. Dynamic branching, what you guys seem to be talking about, is different in that the condition for the branch is determined by the vertex program itself, for example based on a distance calculated in the vertex shader (and thus possibly very different result based on whic vertex you process) you execute a different section of a program (say far away simple light model, close by detailed light model). Anyway static branching is fairly easy and does not give a lot of pipeline stalls or problems. The second however can be a pain because you have a SIMD structure where suddenly 4 paths are no longer guaranteed to follow the exact same path... The same is true for Pixel Pipes, they work on 4 or 8 pixels at the same time, because of the structure of the pipes you want to execute the exact same operations on all 4 pipes... now with complete dynamic conditional branching you get the situation where all 4/8 pipes might have to execute a different branch... SIMD architectures love that... NOT ! CPUs do not use SIMD instructions when they know that they do not have elements that are of a SIMD nature. So all your CPU talk is of little use when you are stuck in a high performance SIMD structure which is always SIMD and nothing but SIMD. IF we see this kind of branching in vertex and/or pixel shader it will potentially be quite slow so developers will have to take great care about how they use it... remember GPUs are highly streamlinded beasts to do graphics, lets not make them "so" flexible that we end up with something as slow as a CPU doing graphics... if this continues throw out your Intel/AMD and put in an ATI/NVIDIA/... (C)GPU. G~ |
|
|
|
|
|
#18 | |
|
Member
|
A couple of things according to this guy:
Quote:
|
|
|
|
|
|
|
#19 |
|
Junior Member
Join Date: Jul 2002
Posts: 91
|
We already know that DX9 contains branching in VS. The issue du jour is the performance implications of putting these operations in the pixel pipeline as well. Yes, we know that but what are its performance implications. Do you already have access to DX9 HW? Many of the issues related to VS branching also apply to PS branching. Of course 3D pipelines are deep. Do you really think that you can go from the triangle setup stage to having a pixel written to the framebuffer in a single clock cycle? In order to achieve pixel-per-clock speed, the 3D pipeline has to hide all kinds of latencies, texture fetches, cache operations, floating point ops that might take more than 1 cycle, AGP accesses, etc. I think it's patently obvious to even the most incredibly obtuse that the ENTIRE pipeline is not a single stage. This was never in question. What we're talking about here SPECIFICALLY is the execution portion of the pixel shader pipeline. If you mispredict on a pixel shader instruction you don't throw away everything starting from your triangle setup! Why do you think that paper references a 20 stage P4 pipeline? Because the other 8 stages outside of the trace cache are used for x86 decode and are not relevant to the discussion. Are we all on the same page now? We're talking about the RELEVANT portion of the pipeline, not the entire thing from start to finish. On modern GPU's, each shader operation is a single cycle throughput. The question is whether it's also single cycle execution. If the execution unit were multi-stage you could potentially have single cycle throughput but not single cycle execution. Floating point pipelines on most modern CPU's are multi-stage. This is why unless you have out of order execution or the compiler schedules it accordingly you are going to run into stalls on dependant operations. The pixel shader on the other hand allows you to use the results from the previous operation on the very next cycle. This implies that the execution unit is 1 stage deep. As Chalnoth and Guest have pointed out, there are two main classes of issues to deal with: 1) Possibly unnecessary texture/data fetches due to conditional branches 2) Pipeline synchronization due to each one potentially evaluating branches differently and taking a different number of cycles to complete. Both of these issues are present (to different degrees) in both vertex and pixel shaders. How does modern DX9 HW address these in the vertex pipeline? The answer to this will give you a clue as to how they plan on supporting this in the pixel pipeline. I'm positive that this is the direction we're moving in and if we don't see it on NV30, you'll definitely see it in the generation after that. |
|
|
|
|
|
#20 | |
|
Senior Member
Join Date: Jan 2002
Location: Abbots Langley
Posts: 732
|
Quote:
__________________
Kristof |
|
|
|
|
|
|
#21 |
|
Join Date: May 2002
Location: New York, NY
Posts: 12,678
|
I still think it would be better to have branching capability than fail to do it because of performance issues.
Future hardware architectures can deal with figuring out how to optimize performance with branching. After all, since 3D is still much more predictable than CPU code, it should still be easier to do the proper branch prediction. What may prove impossible, however, is keeping all pixel pipelines full as often as they are today. But, game developers need not use branches. Just because they exist in hardware doesn't mean that they need to reduce performance. Any smart developers will only use them in limited situations, and the hardware designers may release white papers on "how to optimally use flow control" in order to keep the pipelines moving. Some examples might be: 1. Avoid sending pixels whose branches flow in a chaotic order (i.e. lots of pixels that follow the same branch should be sent at once). 2. Keep from having an excessive amount of branching. Anyway, to me, branching is just a way to avoid the need for auto-multipass, which we know could slow down the pipelines even more (except, perhaps in scene graphs). |
|
|
|
|
|
#22 | ||||||||
|
ea_spouse is H4WT!
Join Date: Feb 2002
Location: 53:4F:4E:59
Posts: 1,586
|
Quote:
Quote:
Quote:
Quote:
Quote:
Quote:
Quote:
(note you can still emulate basic if/then with current (8.1) vertex and pixel shaders), Quote:
|
||||||||
|
|
|
|
|
#23 |
|
Regular
|
It would be nice if you could efficiently implement something like <A HREF=http://graphics.lcs.mit.edu/~gs/research/dispmap/>this</A> in a pixel shader, which is IMO not possible without conditional branches.
|
|
|
|
|
|
#24 |
|
ea_spouse is H4WT!
Join Date: Feb 2002
Location: 53:4F:4E:59
Posts: 1,586
|
Is there some reason why you'd want to do this in a pixel shader (vs. say a vertex shade)?
It doesn't look too different from sampled discplacement mapping which is one of the possible methods described that DX9 supports and I'm pretty sure it's very similar to what Parhelia does (and similar to the 9700 as well) |
|
|
|
|
|
#25 |
|
Regular
|
I dont know what exactly sampled displacement mapping is, but AFAICS with ATI's and Matrox's methods its hard to even guarantuee pixel precise tesselation of the base surface (potentially if it exists NVIDIA's programmable tesselator should make that much easier) let alone the final result. This method is pixel precise (well not entirely, but close enough) without needing conservative tesselation factors (ie very high) or adaptive tesselation (much slower, and could not be done in a single pass in a vertex shader of course ... would require some complex multipass hacks storting intermediate results in textures, again though would probably be much easier with NVIDIA's programmable tesselator if it exists).
As for why pixel shaders, a very simple reason ... there are more of them. |
|
|
|
![]() |
| Thread Tools | |
| Display Modes | |
|
|
Similar Threads
|
||||
| Thread | Thread Starter | Forum | Replies | Last Post |
| 5900? NV4x! | Frank | 3D Architectures & Chips | 71 | 10-Oct-2003 21:12 |
| DM's 65 nm transistor density analysis based on Sony EE data | DeadmeatGA | Console Technology | 61 | 31-Aug-2003 07:03 |
| PSP ?? | V3 | Console Technology | 243 | 17-May-2003 02:58 |
| DirectX 10 working specs? | Reverend | 3D Architectures & Chips | 21 | 21-Apr-2003 09:59 |
| Questions about NV30 | Reverend | 3D Architectures & Chips | 76 | 06-Aug-2002 11:11 |