Flow control architecture - some questions

alexsok

Regular
First, a comment from one of the guys on 3DGPU forums:

Someone started an obscure thread somewhere and rather than respond there I thought I raise this one about how NV30 will be so different to anything before - even Radeon 9700.

Flow control is part of the rumoured new architecture of NV30 under directx 9 or OpenGL 2.0.

Flow control for a GPU is simply if statements and looping constructs in vertex and pixel shader code. No big deal - think again!! Done right and you might get a 30% performance boost, decrease your video ram bottlenecks and decrease your dependence on fast CPUs to send you heaps of triangle data!!!

NVidia's top 3D guys (William R. Mark and Matt Papakipos) co-wrote a Stanford University paper about the future of 3D programming. This sparked all the "in the future 3D programming with flow control will revolutionarise 3D programming" stuff.

Basically if you put conditional looping into pixel/vertex shaders they showed you SIGNIFICANTLY lower the memory bandwidth needed by graphics chip to graphics memory (i.e. how to put 20+ GB/sec of bandwidth to very, very good use) without incurring and intolerable rises in GPU execution overheads. Given GPUs are doubling in executing power 3 times faster than CPUs this makes things look very attractive. So in the future games may never again be CPU limited - think the Radeon 9700 under JDK wasn't it only matching the GeForce 4 Ti4600 on a 2.5 GHz P4?

This sparked a lot of (probably valid) speculation that this was what NV30 is all about, as the hardware required to make it all tick effectively is about the level we hear rumoured than NV30 will deliver. They dropped alot of hints that the hardware to do all this will be available within the next year or two (or alot sooner).

That paper he was talking about is available here:
http://graphics.stanford.edu/papers/rtongfx/

In previous threads, it was noted by some people (not all) that flow control is not a big deal...

This guy makes it sound like it is!

Any ideas?

Also, a couple of things were suggested as to what NV30 is likely to have:

1. "The one advantage that NV30 will have over R300 is the programmable triangle tessellation unit. It's a fantastic feature but unfortunately one that will not be taken advantage of by game developers in its product lifecycle. (Heck, games are just now starting to take advantage of shaders which have been around since NV20!) Still, it's nice to see this type of innovation because it paves the way for making this a standard feature in future designs"

2. "A virtually unlimited number of pixel shader instructions"

The reason I'm bringing it all up is because I'm intrested in knowing how NV30 will stand up against R300.
 
One thing that I can see is that if you don't have flow control in hardware, you should still be able to achieve the same affect by using another pass for each branch (i.e. each "if" statement would be a new pass). Obviously this might allow for a very significant performance improvement.

However, things are not all rosy when it comes to flow control. For one, any sort of branching can cause major stalls in any pipeline. One of the reasons 3D accelerators are so fast is that they are very predictable, allowing for very efficient caching. Flow control significantly reduces the predictability of 3D renderers, making smart caching much more challenging.

While it may be true that flow control is better than multiple passes, it is not better than doing everything in one pass.

One thing that does upset me about what some people have said about the R300 is that it is designed to do everything in one pass. Since it does have limited program sizes and no flow control in the pixel shader, it is certain that there are shaders that would take more than a single pass. Hopefully we'll have other hardware out soon that does not have these limitations.
 
Thx for the info m8! :D

One thing I can safely say is that flow control will definetly be in NV30, I know that for certain, don't ask me from who or how cause I won't tell :D
 
I know that vertex shader flow control will be in the NV30 (from the Cg specs). But will it be in the pixel shader?
 
Chalnoth said:
I know that vertex shader flow control will be in the NV30 (from the Cg specs). But will it be in the pixel shader?

I'm glad to tell u that flow control will be in PS as well as VS! :D

As I said, my source ROCKS!!!! :D
 
That paper is talking about raytracing. PS flow control only benefits that by reducing multipass. Lots of other features are discussed in that paper like PS writing stencil bits directly, early fragment kill, etc.

But flow control won't make any different for most of the shaders being used in games because no one is going to waste their time trying to do raytracing in a game.
 
Besides just reducing the number of passes, flow control reduces the need for automatic multipass rendering in HLSL's, which I think we've already decided can have a serious impact on performance.
 

However, things are not all rosy when it comes to flow control. For one, any sort of branching can cause major stalls in any pipeline.


That's not true. The reason you normally have a major stall in a CPU pipeline is on a mispredicted branch. And this is only when you are speculatively executing instructions down the predicted path and are forced to flush them out of the pipeline and start fetching from the actual calculated path. The longer the pipeline, the longer the stall. This is why the BPU on a 4 stage CPU like the MPC7400 can be much simpler than lets say the P4 -- because the hit is negligible even when it happens. On these modern day GPU's it's even simpler because we're talking about only one operation being in the pipe at any given time. If there's a mispredict you might stall for a cycle at worst as you throw away your speculative fetch.

You have to be careful in generalizing things you know about CPU's and applying them to GPU's because at the moment they are still significantly different in many ways.

As far as the triangle tessellation unit (or the Programmable Primitive Processor - PPP as some people call it) there is no question we will see this on the NV30. Again, this is a fantastic idea and one that will revolutionize how we deal with things like continous level of detail on terrain meshes etc. The problem is that many of the games we're going to see in the next 3 years are going to be based on either the UT2k3 or the Doom3 engines and neither of them will take advantage of this feature. As I stated earlier, it'll help advance 3D in general and hopefully ATI also adds this to R400 but IMO it's not a feature that'll actually be utilized in this generation.
 
CMKRNL said:
That's not true. The reason you normally have a major stall in a CPU pipeline is on a mispredicted branch.

GPU's must attempt to put texture data in the caches before rendering. This is where most of the stalls would occur. Additionally, there may be problems with keeping all of the pipelines flush if the GPU has a hard time figuring out how many clock cycles it will take each pixel pipeline to finish a pixel.
 

GPU's must attempt to put texture data in the caches before rendering. This is where most of the stalls would occur.


Ok so you're talking about the pixel pipeline specifically then? I thought you were referring to the vertex pipeline since that's the one with flow control on the R300. Even in the pixel pipeline case, I'm not sure if this is entirely true. On traditional pixel shader programs the HW is constantly prefetching all of the textures that you reference. How intelligent will next-gen HW be? Will it look at the scope of the loop and only prefetch what is necessary? If so, then yes it will take a hit on a mispredict but we can safely say that for the most part we will see very high branch prediction rates regardless of whether it's dynamic or not.


Additionally, there may be problems with keeping all of the pipelines flush if the GPU has a hard time figuring out how many clock cycles it will take each pixel pipeline to finish a pixel.


See above. The HW has to enforce sync across all pipelines so perhaps they will always prefetch all textures referenced in the code. This gets rid of sync issues due to mispredict as well as the added HW complexity in analyzing the scope of the loop or speculatively evaluating both sides of the branch.
 
CMKRNL said:
Ok so you're talking about the pixel pipeline specifically then? I thought you were referring to the vertex pipeline since that's the one with flow control on the R300. Even in the pixel pipeline case, I'm not sure if this is entirely true. On traditional pixel shader programs the HW is constantly prefetching all of the textures that you reference. How intelligent will next-gen HW be? Will it look at the scope of the loop and only prefetch what is necessary? If so, then yes it will take a hit on a mispredict but we can safely say that for the most part we will see very high branch prediction rates regardless of whether it's dynamic or not.

Yes, I was talking about pixel pipeline branching...for the simple reason that we certainly know that vertex pipeline branching is coming in the R300 and NV30. Others here have argued that pixel pipeline branching wouldn't necessarily be a good thing because it would hurt performance.

As for filling texture cache, branching makes it so that there are more possible textures, and reading the wrong ones into cache will eat up performance. Yes, you could read all possible textures into the caches, but if some of those aren't used, you aren't making full use of the available performance. This approach also requires quite a bit more cache (particularly if you're talking about 128-bit source art, such as what you might want to use for bump maps). If you attempt to predict which textures are to be used, you are bound to cause stalls.

I'm sure that vertex shaders will also run into problems related to branching, but I'm not completely certain what those might be. I suppose it would depend on the program.

If it were possible in the vertex shader for the calcualtion of one vertex to be dependent upon the outcome of the previous vertex, then branching could be hell on parallelism.

See above. The HW has to enforce sync across all pipelines so perhaps they will always prefetch all textures referenced in the code. This gets rid of sync issues due to mispredict as well as the added HW complexity in analyzing the scope of the loop or speculatively evaluating both sides of the branch.

Branches might also actually require a different number of textures.
 
GPUs have incredibly deep pipelines in comparison to CPUs and no GPUs have branch prediction at the moment. A pipeline stall on a GPU would kill performance. Of course, the solution is to start to put branch prediction logic into the GPU. Again, this significantly complicates the GPU for no compelling reason (most of the pixel shading algorithms don't need this), and we know from history that branch prediction only takes you so far. Intel's Itanium is an attempt to solve the superscalar problems that branch prediction/speculative execution techniques have plateau-ed at.

(right now, pixel shaders are little more than multitexture register combiner units, and you guys are talking about adding speculative execution and branch prediction! Get REAL!)

In essense, we'd come full circle, but burdening GPUs with the performance killing paradigms that CPUs have had to grapple with for years, meanwhile, IA-64 is attempting to shed this speculative execution baggage by embedding the logic into the compiler and removing it from the instruction set architecture that requires the CPU to do it right now on IA-32.

When someone comes up with a overwhemlingly compelling pixel shader that needs this stuff, you might convince me, but I'd been doing RenderMan for years, and I haven't encountered anything that truly needed it. ATI's Renderman conversions to real time demos bear witness to this.

Keep the pixel pipelines lean and mean.
 
GPUs have incredibly deep pipelines in comparison to CPUs

This is interesting, one would think this would allow them to scale significantly higher, than their current clock rates?

I'm sure it doesn't have anything to do with switching speed or interconnect parsitics, in terms of limiting clock rate. Power might be a concern since it rises quadratically when frequency is scaled. I dunno just doesn't make much sense why they're not faster instead of largely going wider, mind you with graphics there is LOTS of room for performance gain in wide issue MPUs, but faster allows you to hide some memory latency and can help in increasing effective bandwidth.
 
As far as I'm concerned, there's no compelling reason currently to optimize for branches in the pixel pipeline, but it is pretty certain that there will be situations where you'll want to use those branches.

I think that, in the future, we will have GPUs with instruction sets similar to the IA64 that would put the branch prediction type stuff in the software.

Anyway, my belief is that sometime in the future, we will have GPUs capable of flow control in the pixel shader, and we will also have pixel shaders written that will work best if the flow control is done in hardware. The sooner we have some base functionality for flow control, the sooner we will be able to see these shaders (The same goes for unlimited-length programs...at least, reasonably unlimited-length...).
 
Yes, you could read all possible textures into the caches, but if some of those aren't used, you aren't making full use of the available performance.


Agreed, but the HW complexity of determining what to selectively prefetch might be too high. I suppose the compiler could calculate this ahead of time and encode this information in the ASIC specific micro-op, but then we get into all of the other associated problems.


I'm sure that vertex shaders will also run into problems related to branching, but I'm not completely certain what those might be. I suppose it would depend on the program.


I don't see this as an issue since vertex data is contigious. For arguments sake, assume each cacheline is 8 DWORDs long. This means that as long as 1 in 8 fetches from a particular buffer are referenced in a branch, it was worthwhile to perform the fetch.


GPUs have incredibly deep pipelines in comparison to CPUs and no GPUs have branch prediction at the moment.


DemoCoder, I don't think that's true. Why do you think the pipelines are incredibly deep? I would think the exact opposite is true as there is no reason for it to be deep. Aside from clock frequency, you don't need to have a lot of operations in-flight as there is no out-of-order execution, and every operation is single cycle execution not just throughput. ie a dependant instruction can execute on the very next cycle. This can be accomplished through data feed forwarding as well, but I really don't think GPU's are anywhere near this level of complexity today. The other thing to consider is your point about no GPU's having branch prediction today. The p10 and R300 both offer branching, yet no branch prediction unit. If the pipeline were indeed that deep, wouldn't this be a necessity?
 
Sigh,

You keep talking about vertex shaders, but it is clear from the context of these discussions from the beginning that we have been talking about the pixel pipeline and branching in that. We already know that DX9 contains branching in VS. The issue du jour is the performance implications of putting these operations in the pixel pipeline as well.

Of course 3D pipelines are deep. Do you really think that you can go from the triangle setup stage to having a pixel written to the framebuffer in a single clock cycle? In order to achieve pixel-per-clock speed, the 3D pipeline has to hide all kinds of latencies, texture fetches, cache operations, floating point ops that might take more than 1 cycle, AGP accesses, etc.

And if you want to discuss the full pipeline, as measured from the vertex unit all the way down to the framebuffer, it is way deep as compared to a CPU pipeline.

For reference, the GeForce3 has about 600-800 pipeline stages (if you count all 4 pipelines).
http://www.ce.chalmers.se/undergraduate/D/EDA425/lectures2002/gfxhw.pdf (page 8 )

So we have incredibly deep pipelines, who's performance depends on heavily parallization and predictable memory access patterns, and now you want to introduce branching into this pipeline, which will disrupt memory access patterns, and you think it will be as simple as slapping a branch prediction unit into it?
 
I agree with DemoCoder, pixel pipes are very deep and long which is why optimisations are still telling you to sort by state and avoid needless toggling between 2 pipeline states (each change might need you to flush the data out of the pipe).

Now we need to specifiy what we are talking about : constant branching or dynamic branching. Let me give an example of the difference. Constant branching in the vertex shader would allow you to enable a certain number of lights (different loop execute count) and enable environment mapping or a default mapping depending on the state of a constant bool. These branches are static because you set them per vertex batch. You set the bool/int constants and then you execute them on a batch of vertices. Dynamic branching, what you guys seem to be talking about, is different in that the condition for the branch is determined by the vertex program itself, for example based on a distance calculated in the vertex shader (and thus possibly very different result based on whic vertex you process) you execute a different section of a program (say far away simple light model, close by detailed light model).

Anyway static branching is fairly easy and does not give a lot of pipeline stalls or problems. The second however can be a pain because you have a SIMD structure where suddenly 4 paths are no longer guaranteed to follow the exact same path... The same is true for Pixel Pipes, they work on 4 or 8 pixels at the same time, because of the structure of the pipes you want to execute the exact same operations on all 4 pipes... now with complete dynamic conditional branching you get the situation where all 4/8 pipes might have to execute a different branch... SIMD architectures love that... NOT !

CPUs do not use SIMD instructions when they know that they do not have elements that are of a SIMD nature. So all your CPU talk is of little use when you are stuck in a high performance SIMD structure which is always SIMD and nothing but SIMD.

IF we see this kind of branching in vertex and/or pixel shader it will potentially be quite slow so developers will have to take great care about how they use it... remember GPUs are highly streamlinded beasts to do graphics, lets not make them "so" flexible that we end up with something as slow as a CPU doing graphics... if this continues throw out your Intel/AMD and put in an ATI/NVIDIA/... (C)GPU.

G~
 
A couple of things according to this guy:

Perhaps its more precise to say flow control is a way to avoid a few tough problems limiting todays 3D graphics, namely how to:

1) decrease the bottleneck on graphics bandwidth for current and future processors
2) decrease reliance on an ultra fast CPU for all games
3) implement a system that makes true world light (i.e. Doom 3 focus area) simpler in most cases
4) reduce coding by introducing loops where sensible to process alot of data the same way.

Those are more than worthwhile objectives. It is not just speed it is detail and realism too. There is noting trivial about making graphics alot more lifelike.

So basically NVidia have said the current architecture everyone has deployed to date has gone about as far as it can go easily, time for a more powerful approach based on more general, powerful GPUs. Researchers and senior staff NVidia have used powerful simulators to model several ways forward (mutli branching algorithms and multi path hardware, unfolding loops and iterative looping based code). They showed looping code avoids many natural limits whilst having more acceptable constraints. Given its Directx 9 / OpenGL 2.0 based it is fairly important. To do a paradigm shift you have to have a sensible approach across all components - not just one area of 3D hardware or OS interfacing routines.
 

We already know that DX9 contains branching in VS. The issue du jour is the performance implications of putting these operations in the pixel pipeline as well.


Yes, we know that but what are its performance implications. Do you already have access to DX9 HW? Many of the issues related to VS branching also apply to PS branching.


Of course 3D pipelines are deep. Do you really think that you can go from the triangle setup stage to having a pixel written to the framebuffer in a single clock cycle? In order to achieve pixel-per-clock speed, the 3D pipeline has to hide all kinds of latencies, texture fetches, cache operations, floating point ops that might take more than 1 cycle, AGP accesses, etc.


I think it's patently obvious to even the most incredibly obtuse that the ENTIRE pipeline is not a single stage. This was never in question. What we're talking about here SPECIFICALLY is the execution portion of the pixel shader pipeline. If you mispredict on a pixel shader instruction you don't throw away everything starting from your triangle setup! Why do you think that paper references a 20 stage P4 pipeline? Because the other 8 stages outside of the trace cache are used for x86 decode and are not relevant to the discussion.

Are we all on the same page now? We're talking about the RELEVANT portion of the pipeline, not the entire thing from start to finish.

On modern GPU's, each shader operation is a single cycle throughput. The question is whether it's also single cycle execution. If the execution unit were multi-stage you could potentially have single cycle throughput but not single cycle execution. Floating point pipelines on most modern CPU's are multi-stage. This is why unless you have out of order execution or the compiler schedules it accordingly you are going to run into stalls on dependant operations. The pixel shader on the other hand allows you to use the results from the previous operation on the very next cycle. This implies that the execution unit is 1 stage deep.

As Chalnoth and Guest have pointed out, there are two main classes of issues to deal with:

1) Possibly unnecessary texture/data fetches due to conditional branches
2) Pipeline synchronization due to each one potentially evaluating branches differently and taking a different number of cycles to complete.

Both of these issues are present (to different degrees) in both vertex and pixel shaders. How does modern DX9 HW address these in the vertex pipeline? The answer to this will give you a clue as to how they plan on supporting this in the pixel pipeline. I'm positive that this is the direction we're moving in and if we don't see it on NV30, you'll definitely see it in the generation after that.
 
Back
Top