Fundamental 3D Microarchitecture

Vince

Veteran
There's always so much talk of the somewhat obscure "3D Pipeline" and it's advantages over general computing. Companies go to great length to put out cute and simplistic block diagrams of their 3D chip such as this one by ATI:

fppu.gif


At it's foundation, what is the underlying logic that is presently driving Vertex Shaders, Fragment Shaders, and the other ops done per clock?

Traditionally, it was my understanding that at heart, the 3D pipeline was nothing but many logic blocks put in parrallel that are dedicated to a specific feature. This was the [main] advantage to dedicated 3D ASICs as that they benefit from a linear preformance scaling with each additional logic block thats added. Bilinear Filtering comes to mind as an op that used many parrallel logic blocks to manipulate and produce a sample per clock.

Is this still relevant? For example, aren't VS's basically just an array of FP/SIMD processors that are in parrallel so that they can execute XX vertices/second? What about the T-Setup?

I think you can see where I'm going with this, anything related to the very lowest level of hardware implimentations would be greatly appreciated as it's never talked about.
 
Well, flow control generally decreases the efficiency of these processors, and makes them potentially not-so-linear, but as long as the state doesn't change often (i.e. don't frequently change shaders or other rendering variables, such as whether or not to do an alpha test), then you have little problem maintaining high efficiency.

Additionally, the NV30 looks to support flow control in the pixel shader pipelines in a form that executes all branches, with the final value chosen at the end. This sort of format preserves the incredible predictability of dedicated 3D hardware, while adding significant flexibility.

Anyway, this is all based on one thing:

The architecture almost always knows exactly how many clock cycles each operation will take. This allows the hardware to relatively easily schedule operations for each pipeline. As long as you're not changing the state all the time, this remains true.

One example of a game that does change the state all the time is Unreal Tournament. Have you noticed how UT doesn't scale well with either CPUs or video cards? This is simply because, at least on the video card end, there's tons of overhead in just managing the massive number of state changes that the hardware simply cannot keep all of the pipelines full.
 
Actually dynamic flow control has little impact on performance as long as the shaders are not dependent to each other, e.g. a vertex shader can not access other vertices. Of course, flow control can make a vertex shader to yield different execution length for different vertices. The pipeline may have to handle these potential pipeline bubbles.

Of course constant based flow control can be faster, since the GPU (or driver) can eliminate the branch instructions by current states to accelerate even further.
 
pcchen said:
Actually dynamic flow control has little impact on performance as long as the shaders are not dependent to each other, e.g. a vertex shader can not access other vertices. Of course, flow control can make a vertex shader to yield different execution length for different vertices. The pipeline may have to handle these potential pipeline bubbles.

Of course constant based flow control can be faster, since the GPU (or driver) can eliminate the branch instructions by current states to accelerate even further.

I think this is the problem with dynamic flow control in a pixel shader...the penalties there are much larger if the execution length is not what was expected. This may be a good reason why nVidia has implemented a method of executing each branch, and then choosing one of them (Which is, I believe, very similar to Intel's Itanium architecture...just go ahead and execute all possibilities, and just drop the incorrect branch once it's deemed incorrect).

After all, if you're going to keep your caches flush all the time, or at least almost all the time, you need to know everything that's going to be coming down those piplines potentially hundreds of cycles before it comes down.
 
But, Chlanoth, wouldn't there be a performance penalty for executing all branches? I mean, there is only a limited amount of shading alu's per pipeline.
 
When you're talking about a penalty of possibly hundreds of instructions lost from one missed branch prediction, executing a couple of instructions in parallel might not be that excessive.

But yes, there are certainly optimizations to be had. The essence of the issue is that the two different situations of actually branching and executing all branches both have their own different performance hits. You simply can't have a branch without having a performance hit. For example, the primary reason that video chips can produce so much more than CPUs is that they generally don't have to worry about branching.
 
Chalnoth said:
For example, the primary reason that video chips can produce so much more than CPUs is that they generally don't have to worry about branching.

Sure? Does it have nothing to do with the fact that current CPUs have 2 to 4 functional units (ALUs) for integer or fps while GPUs have 2 or more full FP SIMD units plus dozens of multipliers, adders and dividers? Or because CPUs can only fetch up to 3 instructions each cycle from a single sequential stream while GPUs can execute completaly in parallel up to 4 vertex shaders and up to 8 pixels? Does it have nothing to do with the fact that there is specific (and expensive in term of transistors) hardware for task (as rasterization) that are really hard for general purpose CPUs?

Sure branches in CPUs hurt even more their relative performance vs hardware renderers but not as much as the other issues. Let's face it the current top PC CPU has around 40 M transistors, the current top GPU has over 100 M transistors. GPUs go at slower clock rates but they do FAR MORE than a CPU per cycle too.
 
Umm .. the way I understand it, the main reason that 3d hardware can reach that much higher processing speed than general CPUs is that every pixel on the screen can be computed independently of every other pixel, and ditto for vertices as well, thus opening up opportunities for massive parallellism.

AFAIK, modern pixel shaders (except possibly Xabre) are able to hold enough threads (where processing of 1 pixel = 1 "thread") in flight to be able to actually absorb a texture cache miss - and seeing as a branch mispredict is less expensive than a full data-cache miss, I really don't think the cost of a mispredicted branch in a pixel shader is bad enough to really hurt performance noticeably. Branches in pixel shaders have problems (like partial derivative computation), but raw performance is not one of them.
 
RoOoBo said:
Sure? Does it have nothing to do with the fact that current CPUs have 2 to 4 functional units (ALUs) for integer or fps while GPUs have 2 or more full FP SIMD units plus dozens of multipliers, adders and dividers? Or because CPUs can only fetch up to 3 instructions each cycle from a single sequential stream while GPUs can execute completaly in parallel up to 4 vertex shaders and up to 8 pixels? Does it have nothing to do with the fact that there is specific (and expensive in term of transistors) hardware for task (as rasterization) that are really hard for general purpose CPUs?

Well, obviously that has something to do with it, but don't forget that there are general processors out there that don't use the x86 instruction set, and not one of them approaches the computational efficiency of a modern graphics chip.

So, in other words, the parallelism of the data does indeed help tremendously in keeping all of the pipelines full. If not for the ability of each and every vertex and pixel to be independent of every other, then there would be no realistic way to fill all of the pipelines all the time. Obviously this helps a ton.

As for branching, however, what you must realize is that modern graphics chips have hundreds of stages in their pipelines. Missing a branch would be a huge penalty. Not only that, but modern graphics hardware doesn't even have any branch prediction units, obviously, so the extra cost in transistors would likely not sit well with 3D chip companies.

And that, I think, is the number one reason we won't see true branching: the transistor cost. After all, true branching would generally just be a performance optimization. But what good is branching when you can instead just implement more processing power?

Still, in the future, it does seem conceivable that we will have compiler-controlled branching, branching that is weak on hardware requirements (large transistor counts) and is tailor-made for specific hardware to prevent the branches from interfering with normal operation. I don't think this could happen until we completely move away from assembly shaders.
 
Branching would be relatively expensive (if implemented with all the bells and whistles that a modern CPU has), but not hugely so... otherwise, it wouldn't be implemented in vertex processors.

A much bigger reason why it isn't being implemented is because it's rarely needed (in addition to the implicit fencing necessary to get derivative operations working). A huge subset of useful, attractive and interesting shaders (which is a very small subset of all possible shaders) can be efficiently implemented using instruction predication and some clever instruction re-ordering (which a good HLSL optimizing compiler should be able to do automatically).

Ray marching techniques are difficult to do with predication (the ray must be marched a constant number of times, which means you have to balance the amount of extra work done on 80% of the fragments with the possibility for introducing shader aliasing in the other 20%), but these are very unlikely to run in real time anytime soon, even if fragment processors did have early exit and branching capability.
 
what you must realize is that modern graphics chips have hundreds of stages in their pipelines

Yes and no. There are lots of stages; however, they are largely independent of each other, and once a fragment reaches the shader stage, a branch isn't going to send it back to primitive assembly.

The way a branch would probably be implemented in graphics hardware would be to create a dedicated branch evaluation unit, and just add a large FIFO in front of it. The branch evaluator would evaluate the branch condition for all fragments in its FIFO, and after determining which branch to take, place the fragment back in the shader's FIFO to await further processing. The bigger penalty would be that individual fragments in a processing unit (most GPUs operate on 2x2 "quads" of fragments at a time) could be running a different instruction, which means that the hardware would need to fetch 4 individual shader instructions at once, rather than using the same instruction for each. This equates to significantly more memory bandwidth being utilized by instruction fetching. But that's about it.
 
I have some doubts about how the partial derivative instructions (DDX, DDY) in NV30 will work.

From what I know they calculate this linear system

Code:
uo = ax0 + b
u1 = ax1 + b

(where are u0 and u1 are the value of the parameter to be interpolated in the two pixels, x0 and x1 the two pixel X coordinates and a and b the interpolation coefficients that have to be calculated).

for each of the components of the input vector between two adjacent pixels (in the X or Y axis). I think an exact computation would require multiple divisions or multiple matrix inversions.

This means that there must be a communication between the different pixel pipes to calculate this values (with the right or the left pipe for DDX or with the above or below pipe for DDY) if they are working in 2nx2n blocks. This means that all the pipes run synchronized in a block of pixels of the same triangle.

But as the most common use of those derivatives seems to be to interpolate texture coordinates or other input parameters (color, depth) from the fragment in subpixel coordinates for antialiasing I wonder if there is a cheaper implementation.

If it was assumed that the input vectors are the original fragment input texture coordinates or parameters (or maybe scaled) the DDX/DDY could use the derivatives already calculated in the triangle setup stage that would be a kind of hidden parameter for the pixel pipe that the DDX/DDY instructions would be able to access. This would avoid any kind of synchronization or even communication (the first is easy if they already work in blocks of the same triangle without branching, but I see the second as somewhat hard because of possible wire delays) between the pixel pipes. However the problem is that if the vector passed as input for DDX and DDY are not related to the texture coordinates or the other parameters calculated in the triangle setup the result from the instruction would be completely wrong.
 
Actually, the most common use for the partial derivative operations is to calculate mipmap LOD (which has been the case since OpenGL 1.0 was released). It's actually quite easy to implement, and has been free in hardware for quite some time.

In general, derivatives of all the input texture coordinates are not calculated -- you just interpolate 1/w and the Barycentric weights for the entire triangle. Then, the interpolated value for any texture coordinate can be computed by solving a simple linear system.
 
You mean the OpenGL equation for parameter (datum) interpolation in a triangle?

f = a * f_a + b * f_b + c * f_c; (this is the version in NV_fragment_shader)

But I think that doing to 3 muls and 2 adds for each parameter (and each set of texture coordinates are 4 parameters) is more expensive than interpolating the value incrementally (an addition) each parameter. And the derivatives (or the slopes more exactly) are just calculated at triangle setup, once per triangle.
 
No, because that increases the amount of data that needs to be stored per fragment.

In order to keep performance high (and hide texture latency), a GPU is processing hundreds of fragments at a time. In order for your system to work, you'd need to keep each of the 8 interpolated values around for each fragment (11x128bits = 1308bits of storage per fragment (!)). Since this is storage that must be kept for all fragments in the pipeline (regardless of whether the interpolator is used in a given shader or not), the cost is enormous.

Comparatively, since only 1 interpolated value can be accessed in a given shader instruction, it is much cheaper to store the 3 barycentric weights + 1/w for the fragment (128 bits total), a tag indicating which primitive it was from, and putting a few extra MACs in the silicon to do free barycentric interpolation.
 
RoOoBo said:
You mean the OpenGL equation for parameter (datum) interpolation in a triangle?

f = a * f_a + b * f_b + c * f_c; (this is the version in NV_fragment_shader)

But I think that doing to 3 muls and 2 adds for each parameter (and each set of texture coordinates are 4 parameters) is more expensive than interpolating the value incrementally (an addition) each parameter. And the derivatives (or the slopes more exactly) are just calculated at triangle setup, once per triangle.

from the sst1-through-avenger coding specs it's obvious that in those classic architectures (circa '96) a TMU would calculate incrementally all fragment's data, with all necessary slopes calculated at triangle setup. of course that does not mean things haven't changed for the past 6 years (or outside of 3dfx).

[ed: for clarity]
 
Well, the early 3Dfx accelerators didn't need more than 64 bits of storage per fragment (2*16bits for diffuse and specular color, 2*16 bits for a single set of 2D texture coordinates), which is less than how much it costs to carry around Barycentric weights and 1/w.

It's really just a storage problem. Storing 11 128-bit values for each of 100+ fragments is a huge amount of transistors (1380x150 = 207k bits * 6 transistors/bit is ~1.2M transistors) that will frequently go unused (no current games are using 8 texture coordinates).
 
gking said:
Well, the early 3Dfx accelerators didn't need more than 64 bits of storage per fragment (2*16bits for diffuse and specular color, 2*16 bits for a single set of 2D texture coordinates), which is less than how much it costs to carry around Barycentric weights and 1/w.

It's really just a storage problem. Storing 11 128-bit values for each of 100+ fragments is a huge amount of transistors (1380x150 = 207k bits * 6 transistors/bit is ~1.2M transistors) that will frequently go unused (no current games are using 8 texture coordinates).

Sorry, I'm a bit lost, where are all those fragments? Stored in a buffer? In different stages of the pixel pipe? From what I understand (in PS2.0 hardware), there are only as many active fragments in the pixel shader stage as the number of pipes (they remain until the shader program is finished). And the other stages (fog, stencil, alpha, blending, ...) don't need anymore the texture coordinate information.
 
gking said:
Well, the early 3Dfx accelerators didn't need more than 64 bits of storage per fragment (2*16bits for diffuse and specular color, 2*16 bits for a single set of 2D texture coordinates), which is less than how much it costs to carry around Barycentric weights and 1/w.

It's really just a storage problem. Storing 11 128-bit values for each of 100+ fragments is a huge amount of transistors (1380x150 = 207k bits * 6 transistors/bit is ~1.2M transistors) that will frequently go unused (no current games are using 8 texture coordinates).

ok, i can see the general storage factor concern, yet i believe the voodoo line used considerably more than 64 bits per fragment interpolators. behold, it's from their interpolators' setup:

Code:
Change in Red with respect to X (12.12 format)
Change in Green with respect to X (12.12 format)
Change in Blue with respect to X (12.12 format)
Change in Alpha with respect to X (12.12 format)
Change in Z with respect to X (20.12 format)
Change in S/W with respect to X (14.18 format)
Change in T/W with respect to X (14.18 format)
Change in 1/W with respect to X (2.30 format)

although it's not clear what the internally-maintained bitness for the above was, it still seems rather a lot. now, what space would a typical-precision barycentric vector take?
 
RoBoBo,

Those extra fragments are all in the shader pipeline's FIFOs (first in, first out), in order to absorb things like memory latency. It takes a while for a read request into DRAM to get a response (page swap, charge, fetch, etc.). If the graphics chip did nothing while waiting for DRAM, performance would be aybsmal (thousands of pixels per second, instead of billions). So, the fragments go into a FIFO to wait for the texture read to complete, and the graphics chip processes other fragments.

darkblu,

Those values should be storable once per triangle, rather than for every fragment. Similarly, the cost per triangle for using barycentric interpolators is quite large (you need to store A, B, and C (4-vectors) for each interpolated value, plus derivative information). However, this is absorbed by a low cost per fragment (f_a, f_b, f_c, 1/w, z, and some tag bits to indicate which A, B, and C to use) and the fact that points can be shared between triangles. It's more than 128 bits (you probably want 192), but it's much cheaper than storing all the post-interpolated values per fragment.
 
Back
Top