Flow control architecture - some questions

I still think it would be better to have branching capability than fail to do it because of performance issues.

Future hardware architectures can deal with figuring out how to optimize performance with branching. After all, since 3D is still much more predictable than CPU code, it should still be easier to do the proper branch prediction. What may prove impossible, however, is keeping all pixel pipelines full as often as they are today.

But, game developers need not use branches. Just because they exist in hardware doesn't mean that they need to reduce performance. Any smart developers will only use them in limited situations, and the hardware designers may release white papers on "how to optimally use flow control" in order to keep the pipelines moving.

Some examples might be:

1. Avoid sending pixels whose branches flow in a chaotic order (i.e. lots of pixels that follow the same branch should be sent at once).
2. Keep from having an excessive amount of branching.

Anyway, to me, branching is just a way to avoid the need for auto-multipass, which we know could slow down the pipelines even more (except, perhaps in scene graphs).
 
The p10 and R300 both offer branching, yet no branch prediction unit. If the pipeline were indeed that deep, wouldn't this be a necessity?

Hey you left out Parhelia! ;)

I think it's patently obvious to even the most incredibly obtuse that the ENTIRE pipeline is not a single stage. This was never in question. What we're talking about here SPECIFICALLY is the execution portion of the pixel shader pipeline. If you mispredict on a pixel shader instruction you don't throw away everything starting from your triangle setup! Why do you think that paper references a 20 stage P4 pipeline? Because the other 8 stages outside of the trace cache are used for x86 decode and are not relevant to the discussion.

Well stall likely wouldn't cause you to flush everything up to triangle setup it would likely cause data execution further back in the pipelines to stall until a mispredict is addressed...

On modern GPU's, each shader operation is a single cycle throughput.

That really only applies to the VS (and even there not all instances)

The question is whether it's also single cycle execution. If the execution unit were multi-stage you could potentially have single cycle throughput but not single cycle execution.

Actually for the most part VS are single cycle execution (with a few exceptions). The trick is to get single cycle throughput.

Floating point pipelines on most modern CPU's are multi-stage. This is why unless you have out of order execution or the compiler schedules it accordingly you are going to run into stalls on dependant operations.

Exactly, or in the case of a VS, it runs multiple threads (think SMT) to schedule ops to hide execution latencies.

The pixel shader on the other hand allows you to use the results from the previous operation on the very next cycle. This implies that the execution unit is 1 stage deep.

Currents pixel shaders are a mess when it comes to this, and are definately not single stage. With DX9 at least you're migrating from CISC like color ops to more general RISC like vector math (which streamlines the instruction set) making it more orthogonal to vertex shaders (and performance profiling becomes easier since you're back to counting cycles like vertex shaders.

Both of these issues are present (to different degrees) in both vertex and pixel shaders. How does modern DX9 HW address these in the vertex pipeline?

Well there's only 4 new instructions that I'm aware regarding flow control and they all evaluate to constant registers so you really don't have "mis-prediction" per-se... There are still limitations to what you can do (no nested sub-routines, jumping in and out of loops)...

(note you can still emulate basic if/then with current (8.1) vertex and pixel shaders),

Future hardware architectures can deal with figuring out how to optimize performance with branching. After all, since 3D is still much more predictable than CPU code, it should still be easier to do the proper branch prediction. What may prove impossible, however, is keeping all pixel pipelines full as often as they are today.

Graphics in general are pretty predicated, so you really don't need branch prediction per se. At most just conditional selects would be more than adequate for most cases, especially if you're following the DirectX route, and evaluating to registers in hardware.
 
It would be nice if you could efficiently implement something like <A HREF=http://graphics.lcs.mit.edu/~gs/research/dispmap/>this</A> in a pixel shader, which is IMO not possible without conditional branches.
 
Is there some reason why you'd want to do this in a pixel shader (vs. say a vertex shade)?

It doesn't look too different from sampled discplacement mapping which is one of the possible methods described that DX9 supports and I'm pretty sure it's very similar to what Parhelia does (and similar to the 9700 as well)
 
I dont know what exactly sampled displacement mapping is, but AFAICS with ATI's and Matrox's methods its hard to even guarantuee pixel precise tesselation of the base surface (potentially if it exists NVIDIA's programmable tesselator should make that much easier) let alone the final result. This method is pixel precise (well not entirely, but close enough) without needing conservative tesselation factors (ie very high) or adaptive tesselation (much slower, and could not be done in a single pass in a vertex shader of course ... would require some complex multipass hacks storting intermediate results in textures, again though would probably be much easier with NVIDIA's programmable tesselator if it exists).

As for why pixel shaders, a very simple reason ... there are more of them.
 
DemoCoder said:
(right now, pixel shaders are little more than multitexture register combiner units, and you guys are talking about adding speculative execution and branch prediction! Get REAL!)

Exactly what I was thinking during reading this thread... :)

I don't think we will see advanced branch prediction logic on GPUs anytime soon, the transistors are massivly needed elsewhere. ;)
 
You dont need branch prediction, you dont even really need speculative execution. You can just stall on each branch, or use delay slots.
 
MfA said:
You dont need branch prediction, you dont even really need speculative execution. You can just stall on each branch, or use delay slots.
Ala MIPS microprocessors, it will work ;)
 
MfA said:
You dont need branch prediction, you dont even really need speculative execution. You can just stall on each branch, or use delay slots.

Well, I've forgotten most of my processor design lectures (And I could draw a MIPS from scratch! ;-) ), but isn't the stall causing a major performance impact thus rendering the performance benefits of flow control useless?
 
I just mentioned stalling because it's the simplest method, delay slots are of course preferred (SH-5 style split branches with speculative execution but without dynamic branch prediction is also nice, but might already be too complex). Even with stalls there are some things you just cant realistically do without conditional branches for which they would still be usefull, say we have a loop which implements an iterative search which we would have to repeat 100 times worst case but only 2 times on average with conditional branches ... without conditional branches you would have to loop it 100 times every time to get the same result.
 
MfA said:
I dont know what exactly sampled displacement mapping is, but AFAICS with ATI's and Matrox's methods its hard to even guarantuee pixel precise tesselation of the base surface
Marco,
I'm not sure it's really necessary to go all the way to pixel level with the displacement map tessellation. I would think that if you stopped with polys that were, say, 10x10, but also applied bump mapping, then the final result will look more than adequate. AFAICS, the only real advantages of displacement over bump mapping is that the silhoutte edges are rough and (in the rare times that it happens) the interpenetration of objects would be correctly modelled. In both cases, I don't think these would need to be modelled at pixel resolution for them to look convincing.
 
Back
Top