ATi presentation on future GPU design

akira888 · Mar 30, 2004

DaveBaumann said:
We're talking about the VS here. The fragment shaders were either one full vector or one < 4 vector op + a scalar, however the vertex shader can cope with one full vector op and a scalar simultaneously (in R300).

Remembered that about 3.5 seconds after I hit submit the first time.

3dilettante · Mar 30, 2004

arjan de lumens said:
Evildeus said:

Is there a processor/GPU that does it fast almost always?

Click to expand...

The massively multithreaded Cray MTA processor. When it detects a branch (or any instruction that might have a >1 cycle latency, such as a memory load) it swaps execution to another thread; by juggling around 100+ threads, it can easily sustain >98% of maxiumum theoretical IPC even on code heavily loaded with branches or memory loads, despite the fact that it has no branch predictor or data cache.

That particular method would seem to be a poor fit for rendering, as it significantly sacrifices single-threaded performance for more parallelism. As long as there are a hundred threads, the processor is fully utilized, but I'm sure most graphics developers would be hard pressed to divide a teapot render into a hundred threads.

Any one task would take much longer to execute, for a super computer, the odds are that this is hidden by the massive number of tasks, and most supercomputer tasks are given hours, if not days to weeks of time to complete.

The average fps gamer would be complaining of something far worse than lag if that were the case.

Can anyone tell me what kind of instruction mix future shaders are likely to have of ops to branches? I recall x86 code has a very significant percentage, which necessitated heavy branch prediction to make up for increased pipelining.

It may only be a matter of time before we start seeing similar mechanisms for branch avoidance/prediction in upcoming video chips if branching becomes as significant as it is in other kinds of code, and if the pipelines are in fact very long.

arjan de lumens · Mar 30, 2004

3dilettante said:
arjan de lumens said:

Evildeus said:

Is there a processor/GPU that does it fast almost always?

Click to expand...

The massively multithreaded Cray MTA processor. When it detects a branch (or any instruction that might have a >1 cycle latency, such as a memory load) it swaps execution to another thread; by juggling around 100+ threads, it can easily sustain >98% of maxiumum theoretical IPC even on code heavily loaded with branches or memory loads, despite the fact that it has no branch predictor or data cache.

Click to expand...

That particular method would seem to be a poor fit for rendering, as it significantly sacrifices single-threaded performance for more parallelism. As long as there are a hundred threads, the processor is fully utilized, but I'm sure most graphics developers would be hard pressed to divide a teapot render into a hundred threads.

A poor fit for software rendering, perhaps, but having a hardware renderer fork off a hundred threads in order to process a hundred vertices or a hundred pixels doesn't sounds that hard to me, as long as the vertices/pixels aren't dependent on each other. IIRC, the vertex shader in NV20 is already 6-way multithreaded to mask the 6-cycle latency of most vertex shader instructions, and the pixel shader in NV30 juggles around ~170 execution threads, corresponding to 170 pipeline steps (with >100 steps set aside for texturing alone; yes, texturing has MUCH higher latencies than piddly branches). With such long instruction latencies, especially for texturing, NOT doing massive multithreading will hurt your performance so badly it isn't even funny.

991060 · Mar 30, 2004

Can anyone enlighten me why GPUs have 100+ pipeline stages while CPUs only have dozens? Is it because GPUs are heavily multi-threaded?

KimB · Mar 30, 2004

991060 said:
Can anyone enlighten me why GPUs have 100+ pipeline stages while CPUs only have dozens? Is it because GPUs are heavily multi-threaded?

Two main reasons:
1. Branches. CPUs have to handle branching on a regular basis, and the shorter pipelines help to keep from having to throw out too much data when a branch prediction goes bad.
2. Latency hiding. By having hundreds of pipeline stages, GPUs can hide the large latency of waiting for texture data, so that GPUs are never left waiting for data when memory bandwidth is still available. Well, that's the goal, at least.

991060 · Mar 30, 2004

Chalnoth said:
Two main reasons:
1. Branches. CPUs have to handle branching on a regular basis, and the shorter pipelines help to keep from having to throw out too much data when a branch prediction goes bad.
2. Latency hiding. By having hundreds of pipeline stages, GPUs can hide the large latency of waiting for texture data, so that GPUs are never left waiting for data when memory bandwidth is still available. Well, that's the goal, at least.

OK, thanks. Can I say when branching is wildly used and there's reasonable size of cache embembed in GPU, the number of pipeline stage is going to reduce?

Luminescent · Mar 30, 2004

For those that would like a simplified explanation for the existance of pipeline stages and why they are sometimes made longer, read ahead:

Many of the operations that take place within the GPU, particulaly pixel ops, are high latency (meaning that they require a relatively high number of clock cycles to complete) and require a multiple steps to execute. A dot product, for example, requires both addition and multiplication. If a multiply has a 1 cylce latency (requires 1 cycle to complete) and an add the same, the operation could be subdivided into a multiply stage and an add stage, so that one stage works on the first input and then sends it to the second; simultaneously, the first can recieve another input and operate on that one. This is a reason why a longer pipeline usually implies that operations are subdivided into many stages which masks (hides) the latency of 1 macro-op, which may require various micro-ops. If a dot product unit, like the one mentioned above, were not pipelined into multiply and add stages, and each operation required 1 cycle, the unit would only be able to output a result every 2 clock cycles.

991060 · Mar 30, 2004

Thanks, I think I know the basic concept behind pipelining, I just didn't realize tex ops have that much latency at first.

KimB · Mar 30, 2004

991060 said:
OK, thanks. Can I say when branching is wildly used and there's reasonable size of cache embembed in GPU, the number of pipeline stage is going to reduce?

I don't expect branching to ever be used widely in GPUs. I would rather expect that IHV's will just have branching run at a lower speed than risk reducing performance under more normal rendering conditions.

arjan de lumens · Mar 30, 2004

I would expect GPU branching to be slow mainly in the case where different pixels in a pixel quad fork off in different directions, so that you suddently need to track 4 instruction pointers for the quad instead of just 1 and therefore need to fetch 4 times as many instructions. Also, in the bundle of instructions executed for a given clock cycle, the branch would probably need to be the last instruction, so you may get wasted instruction slots.

Other than that, there is no reason why you can't use existing multithreading in the GPU to eat the branch latency and thus get cheap branches.

991060 · Mar 30, 2004

It seems, it's the available resources rather than the latency the main problem concerning using branch, sireric from ATi just mentioned it in this thread: http://www.beyond3d.com/forum/viewtopic.php?t=9985

I found the two sides of the problem contradicting: longer shader favors using branch because of potential save on instruction count, but it also consumes more resources which makes starting a new thread even harder.

Evildeus · Mar 30, 2004

arjan de lumens> How do you see the implementation of branching in GPU? What are, in your opinion, the advantages and inconvenients?

PS: Thanks for the time you take to explain

aaronspink · Mar 30, 2004

arjan de lumens said:
The massively multithreaded Cray MTA processor. When it detects a branch (or any instruction that might have a >1 cycle latency, such as a memory load) it swaps execution to another thread; by juggling around 100+ threads, it can easily sustain >98% of maxiumum theoretical IPC even on code heavily loaded with branches or memory loads, despite the fact that it has no branch predictor or data cache.

The MTA doesn't detect branches. The Architecture is that of a barrel processor where each cycle an instruction from a different thread is executed. If there aren't enough threads available then a "null thread" is inserted. The threading on the MTA is not in anyway dynamic.

Aaron Spink
speaking for myself inc.

aaronspink · Mar 30, 2004

arjan de lumens said:
A poor fit for software rendering, perhaps, but having a hardware renderer fork off a hundred threads in order to process a hundred vertices or a hundred pixels doesn't sounds that hard to me, as long as the vertices/pixels aren't dependent on each other. IIRC, the vertex shader in NV20 is already 6-way multithreaded to mask the 6-cycle latency of most vertex shader instructions, and the pixel shader in NV30 juggles around ~170 execution threads, corresponding to 170 pipeline steps (with >100 steps set aside for texturing alone; yes, texturing has MUCH higher latencies than piddly branches). With such long instruction latencies, especially for texturing, NOT doing massive multithreading will hurt your performance so badly it isn't even funny.

No current GPU juggles threads. Period.

The closest CPU architecture to a GPU is a stream processor or a vector processor. The architecture isn't "run these x number of threads". The architecture is closer to "run this instruction sequence on these x data elements". There is a big architecture and micro-architecture difference between the two. Like in a stream processor, when you want to run a different instruction sequence, the pipeline takes a big performance hit as the last instruction sequence is drained and the new instruction sequence is added.

Same with a vector architecture. In a vector architecture you run the same instruction sequence over a large number of data elements. The performance advantage of both stream and vector architectures is in no small part do executing the same instruction stream over a large number of data elements so that they latencies are hidden by the continual pipelining of the data elements.

To support true branching in a GPU, the GPU would effectively have to be made from a large number of general purpose CPUs (for example x numbers are ARMs) with each CPU executing multiple threads. The overhead in design, implementation, and silicon is significant. If GPUs are designed in this method than the majority of their advantage vs standard CPUs will disappear.

Aaron Spink
speaking for myself inc.

aaronspink · Mar 30, 2004

arjan de lumens said:
I would expect GPU branching to be slow mainly in the case where different pixels in a pixel quad fork off in different directions, so that you suddently need to track 4 instruction pointers for the quad instead of just 1 and therefore need to fetch 4 times as many instructions. Also, in the bundle of instructions executed for a given clock cycle, the branch would probably need to be the last instruction, so you may get wasted instruction slots.

Other than that, there is no reason why you can't use existing multithreading in the GPU to eat the branch latency and thus get cheap branches.

The only way to really do branching in current GPU architectures is via dead instructions. You will execute all the instructions in a shader execpt the output on non-taken instruction will be rejected. Early out is only really beneficial when it applied to all the pixels within a group. If even one pixel doesn't take the early out then you'll most likely end up executing the shader for all the pixels.

Aaron Spink
speaking for myself inc.

psurge · Mar 30, 2004

this might be a stupid idea, but how about this:

instead of running a program P comprised of instructions for a single sample on a 2x2 sample stamp.

1.) use only 3 samples per stamp (you don't need more for ddx/ddy)

2.) the programmer writes P, the compiler generates a program P' consisting of instructions for all 3 stamp samples, meaning that you have:
- 1 instruction pointer / loop register / condition register (for all 3 samples)
- taken branches are executed 1 sample at a time.
- unified register space for all 3 samples

Pros:
- for common code segments, the optimizer can interleave instructions from all 3 threads.
- besides ILP loss due to reduced scope for instruction reordering, there is no penalty for samples taking divergent branches.

Cons:
- up to 3x the number of instructions
- latency per sample goes up
- texture fetch issue becomes burstier?
- probably more to add here

What do you think?
Serge

MfA · Mar 31, 2004

What's in a name ...

NVIDIA has said this about their vertex shader :

The floating-point core is a multi-threaded vector processor operating on quad-float data. Vertex data is read from the input buffers and transformed into the output buffers (OB). The latency of the vector and special function units are equal and multiple vertex threads are used to hide this latency.

Execution units for pixelshaders have more than 1 context and switch between them to overcome latency. Sounds like SMT to me.

GPUs have aspects of vector and stream processors, and are multithreaded. These naming schemes are not mutually exclusive (you can do stream processing with or without multithreading, Stanford does without for instance).

aaronspink · Mar 31, 2004

MfA said:
What's in a name ...

NVIDIA has said this about their vertex shader :

The floating-point core is a multi-threaded vector processor operating on quad-float data. Vertex data is read from the input buffers and transformed into the output buffers (OB). The latency of the vector and special function units are equal and multiple vertex threads are used to hide this latency.

Click to expand...

Which is why in a lot of cases CPUs can out perform a vertex shader...

Execution units for pixelshaders have more than 1 context and switch between them to overcome latency. Sounds like SMT to me.

Which is of course why if you execute state changes with anything approaching often the performance will take a dive to useless????

Pixel shaders DO NOT have more than one context. They are NOTHING like SMT at either the architecture or micro-architecture level.

GPUs have aspects of vector and stream processors, and are multithreaded. These naming schemes are not mutually exclusive (you can do stream processing with or without multithreading, Stanford does without for instance).

The Pixel shaders are NOT multithreaded. Their performance characteristics prove this beyond a shadow of a doubt. The pixel shaders in all modern GPUs resemble either a Vector processor with SIMD operations or a Stream processor with SIMD operations.

While some people like to do a lot of grouping of VLIW/SMT/CMT/Vector/Stream into one category, it is incorrect. The performance and design characteristics between a Stream processor and a SMT processor are worlds apart, more apparently in their ability to change flow.

And while Stream processors could perhaps be multi-threaded, they would lose a large amount of their efficiency and performance. The best way to think of a Stream processor is as an application specific FPGA datapath: shove data in and it comes out with complex operations performed on it. Just don't try to be dynamic in what those complex operations are and the performance will be good.

Aaron Spink
speaking for myself inc.

MfA · Mar 31, 2004

If you have a stream processor which can work on "elements" in the stream out of order or if it executes instructions on X "elements" in the stream in a round robin fashion then it will need multiple contexts, and will need to switch between them. In a regular processor you would call that multithreading, in a GPU the only difference is that part of the context is fixed ... because they are all executing the same shader.

Still makes sense to call it multithreading IMO. NVIDIA seems to agree, researchers in this field agree (take this for instance) and you disagree. You are outnumbered, and in language being outnumbered means being wrong. If the 3D world wants to use it's own lingo it will, simple as that.

DemoCoder · Mar 31, 2004

I agree with Mfa. This is getting like the old "what's multisampling" discussions. The definition of industry terminology is based on consensus. If enough people start calling something X, it becomes known as X.

What are the non-pathological cases in which any modern processor can outperform the per-clock performance of a vertex shader. Sure, a P4 @ 3Ghz can catch up to a 4-functional unit at 500Mhz in throughput, if all the stars are in alignment with respect to memory and cache usage, and instructions used, but in the vast majority of cases I've seen, the best P4s and Athlons get blown away by even DX8 hardware.

A stream processor doesn't neccessarily have to act on one unit of input at a time. A "chunked" stream processor is also a viable alternative on which a kernel of data is processed in any out of order fashion with a sliding window.

ATi presentation on future GPU design

Similar threads