Less is more

Frank

Certified not a majority
Veteran
Graphics hardware has come a long way in the last 10 years. From the original SGI cards that required a separate cabinet, through fixed-function pipelines, through SIMD towards MIMD. So, what direction are we heading?

First, the challenge was in fitting everything inside one chip, of course. When the original SGI cabinet fitted on one card, the focus shifted towards speed and features. More options to produce better graphics. And speed to increase the amount of objects, to produce better graphics as well.

But a fixed-function pipeline is like a factory. Compared to a car factory, it went from "any color, as long as it's black" to "What acessories do you want with what model?". And at that moment it hit the wall.

The pipeline model enabled you to choose if you wanted to use each option and let you choose from the available models, but it didn't offer the possibility to add something that wasn't in stock, or for which there was no machine to produce it.

You had your basic building blocks and the limited ways in which they could be connected. And that was it. Although a lot of clever ways were designed to do additional things, like using textures to pre-compute certain effects.

The things that improved the result and were valued by the designers were added as new machines and process steps. But it was still just an assembly line to churn out as much models as possible, consisting of predefined parts.

The clever hacks that were incorporated, mostly the lot of clever things you could do with textures, inspired the use of shaders, that could compute the values from those textures on the fly. And enabled new ways to transform the view of the model.

Yea, DUH!

Ok. As we saw with the evolution of CPU's, at some moment it becomes unfeasible to just add new functions, especially as you realize, that you can do it all, and faster as well, by just combining less functions faster. RISC.

That has happened already with GPU's, as most current ones emulate the old, fixed function pipeline in software. And it seems almost inevitable to continue going RISC. But they have hit the wall again.

This time, the problem is SIMD. Quads. Anything can be produced, as long as you take multiples of four of them. If you only want one, throw three away. But it is not feasible to reduce the quads to single pipes, as you would only be able to produce less than half the amount of pixels for the same number of transistors.

The vertex processors are MIMD, but they gain a lot less from SIMD. They produce single instances anyway. Or you could use SIMD to produce a lot of identical units. But, that is less desirable, because even if the objects are essentially the same, you would want them to be individual. Like when creating a crowd, you want all of them to be unique.

You could fix that by breaking up the objects you want to duplicate and use different colors and models, that you fit together. Essentially, you produce a fixed-function pipeline to generate objects...

So, the main problem for both types of shaders is producing lots of identical units versus producing less unique ones.

While the functionality of the VS and PS will be merged, the function of both units is quite distinct, although they perform the same operations if you break down those operations far enough and disregard the different storage and sorting requirements. And those last two are interesting.

We have buffers filled with vertices, triangles, textures, heightmaps, cubemaps, offsetmaps, lightmaps, z-maps, pixels, shaders, fragments and lots of other intermediate objects. All of which are added to support the assembly line model. Most of them have no use in themselves, you just need them to supplement the current model of creating graphics.

Now, there are more ways to render a scene than the most common one, like tile-based and ray-tracing. And most ways to speed up the brute-force approach from the immediate one that uses vertices and a rasterizer requires a representation of the whole scene to do it more clever.

Instead of just processing random vertices, fitting those together for rasterization and churning out pixels as fast as possible, you can assemble them all into triangles and a full scene first. Essentially, you use surfaces instead of vertices, pixels and all the intermediate values (like Z) you need to produce a scene directly from them.

If you have all surfaces, you can do ray-tracing. Or you can do real anti-aliasing, by adding the colors of all the subpixels at the divisions of the triangles.

To do all that and more, we need a clever way to store all those surfaces, so we can look up all relevant (sub-) pixels as fast as possible. And we need descriptors, for things like textures, transparency, fog and light sources (intensity).

Also, if we use whole models, we need to know what is solid. Therefore, wouldn't it be easier to describe not the solids, but the space inside those solids? For CGI, any scene is indoors!

Done right, this gives us free collision detection, real anti-aliasing and easy ray-tracing.

So, instead of rendering (or discarding) all individual vertices and pixels, we would calculate a single object, that consist of a single mesh, that fills all available open space. And we only render that single object.

That way, the bottleneck represented by quads and brute force processing becomes moot. And as the CPU becomes more and more the bottleneck for pre-calculating effects like water, displacement mapping and whatever, we can do all of those things at the same moment we calculate the single mesh!

That also makes it quite easy to develop a unified shader and create a real standardized way to render graphics.

What do you think?
 
DiGuru said:
Graphics hardware has come a long way in the last 10 years. From the original SGI cards that required a separate cabinet, through fixed-function pipelines, through SIMD towards MIMD. So, what direction are we heading?
Just as an aside, did you know that many of the early graphics systems were programable and then drifted to become fixed function?
 
Simon F said:
DiGuru said:
Graphics hardware has come a long way in the last 10 years. From the original SGI cards that required a separate cabinet, through fixed-function pipelines, through SIMD towards MIMD. So, what direction are we heading?
Just as an aside, did you know that many of the early graphics systems were programable and then drifted to become fixed function?

I've read somewhere that the Rendition Verite 1000 was basically a RISC processor and it was painful to code for.
 
DiGuru said:
Ok. As we saw with the evolution of CPU's, at some moment it becomes unfeasible to just add new functions, especially as you realize, that you can do it all, and faster as well, by just combining less functions faster. RISC.

That has happened already with GPU's, as most current ones emulate the old, fixed function pipeline in software. And it seems almost inevitable to continue going RISC. But they have hit the wall again.

This time, the problem is SIMD. Quads. Anything can be produced, as long as you take multiples of four of them. If you only want one, throw three away.
I think you have misunderstood a common principle of HW. You seem to be implying that small is good (i.e citing RISC) and that it's very bad to throw away some of your results but that happens all the time in CPUs, even RISC ones. Speculative branching and predicated execution are two such examples which make things run faster yet they often do completely redundant work.
 
There's the conception here that RISC was proven to be the better architecture. This was thought true back in the early 90's, but nowadays it's much less clear cut. AFAICS this is because the cost of decode logic has dropped relative to the cost of 'fixed cost' items such as ALU and caches and therefore the importance of extracting the most out of them. If the CISC architecture could get 10% more ALU utilisation and achieve 30% effective 'compression' in the instruction cache, it may be cheaper.

As regards the packing of data into quads... well, I have an advantage in that I can ask our chips exactly what the quad population is ;). We aren't throwing all that much away.
 
Simon F said:
DiGuru said:
Ok. As we saw with the evolution of CPU's, at some moment it becomes unfeasible to just add new functions, especially as you realize, that you can do it all, and faster as well, by just combining less functions faster. RISC.

That has happened already with GPU's, as most current ones emulate the old, fixed function pipeline in software. And it seems almost inevitable to continue going RISC. But they have hit the wall again.

This time, the problem is SIMD. Quads. Anything can be produced, as long as you take multiples of four of them. If you only want one, throw three away.
I think you have misunderstood a common principle of HW. You seem to be implying that small is good (i.e citing RISC) and that it's very bad to throw away some of your results but that happens all the time in CPUs, even RISC ones. Speculative branching and predicated execution are two such examples which make things run faster yet they often do completely redundant work.

No, that's just because they use deep pipelines. They are just better than the alternatives of stalling the pipeline and/or calculating both. They are not eficcient, they are just less bad.
 
Dio said:
There's the conception here that RISC was proven to be the better architecture. This was thought true back in the early 90's, but nowadays it's much less clear cut. AFAICS this is because the cost of decode logic has dropped relative to the cost of 'fixed cost' items such as ALU and caches and therefore the importance of extracting the most out of them. If the CISC architecture could get 10% more ALU utilisation and achieve 30% effective 'compression' in the instruction cache, it may be cheaper.

As regards the packing of data into quads... well, I have an advantage in that I can ask our chips exactly what the quad population is ;). We aren't throwing all that much away.

Yes, but you don't use dynamic branching or other instructions that can result in different render paths for each pixel, do you? And that's what you need to do to go to a more general, flexible model. Without flow control, it is not really a general architecture and doing things like ray-tracing and dynamic branching (although they can be emulated) are less efficient than using the brute-force approach of just trying to render or discard all vertices and pixels as they come along...

And don't misinterprent the current CISC processors like the x86: they're just superscalar RISC cores with some microcode around them to emulate CISC. Just like you do to emulate the fixed-functions. There is really no other direction for the core to go.
 
DiGuru said:
No, that's just because they use deep pipelines. They are just better than the alternatives of stalling the pipeline and/or calculating both. They are not eficcient, they are just less bad.

And you don't think graphics chips have deep pipelines? :oops:

If a CPU is like transporting the oil from the North Sea, then a graphics chip would have its pipes in the Marianas trench!
 
Simon F said:
DiGuru said:
No, that's just because they use deep pipelines. They are just better than the alternatives of stalling the pipeline and/or calculating both. They are not eficcient, they are just less bad.

And you don't think graphics chips have deep pipelines? :oops:

If a CPU is like transporting the oil from the North Sea, then a graphics chip would have its pipes in the Marianas trench!

Yes, they have deep pipelines. So, when they want to do branching, they have those problems as well. And when those pipelines are SIMD, those problems become even bigger.

Of course, they could change the programming model such, that not pixels but quads become the smallest units or forego flow control completely. That would work. But would it make a better GPU?
 
I think you should ask yourself, "what would the image look like if all pixel pipelines were doing completely different things?"
 
Simon F said:
I think you should ask yourself, "what would the image look like if all pixel pipelines were doing completely different things?"

That depends entirely on the method used to render the image. For example, you would get a different answer from the people that use ray-tracing instead of the brute force method.

And even then, it's a chiken-and-egg problem: as long as there is no good way to use conditionals and branching, there won't be many people who use them and there won't be much interesting things you can do with them.
 
Dio said:
...
AFAICS this is because the cost of decode logic has dropped relative to the cost of 'fixed cost' items such as ALU and caches and therefore the importance of extracting the most out of them. If the CISC architecture could get 10% more ALU utilisation and achieve 30% effective 'compression' in the instruction cache, it may be cheaper.
...

The (not very succesful) VLIW initiative from Intel, IBM, HP and some of the other Big Boys, was meant to do away with decode logic as much as possible. Essentially, each bit in the instruction word would directly trigger an action. It would save lots of transistors and enable them to remove the CISC layer around their RISC cores.

The main reason that it failed, is that the programmers and software are too entrenched, it is not feasible to come with a completely different instruction set that is incompatible with the old ones anymore.

Apple did something alike and succeeded, but it is doubtful such a new architecture would succeed otherwise. So, it is very important to keep as much of the API intact as possible. For GPU's that would work best with an approach like GLSL.

Anyway, I think a lot of transistors in current GPU's are used for support mechanisms like Z-buffers, normal maps and other things to help the current model. With a new model, those could be used for other purposes.
 
DiGuru said:
Simon F said:
I think you should ask yourself, "what would the image look like if all pixel pipelines were doing completely different things?"

That depends entirely on the method used to render the image. For example, you would get a different answer from the people that use ray-tracing instead of the brute force method.
You didn't think about it enough.
 
Simon F said:
DiGuru said:
Simon F said:
I think you should ask yourself, "what would the image look like if all pixel pipelines were doing completely different things?"

That depends entirely on the method used to render the image. For example, you would get a different answer from the people that use ray-tracing instead of the brute force method.
You didn't think about it enough.

What do you mean?

Like, flow control would not be a good thing? Even for things like ray-tracing or procedural textures that are calculated per pixel when needed?

Or do you just mean that the image would look static, due to poor performance? Otherwise, I don't get it.
 
DiGuru said:
Anyway, I think a lot of transistors in current GPU's are used for support mechanisms like Z-buffers, normal maps and other things to help the current model. With a new model, those could be used for other purposes.
This is the 'More generalisation for higher utilisation' argument. Surely, all you need are a bunch of FIFOs, a bunch of caches, and a bunch of general processors? It's been tried before, and so far it hasn't worked.

The main reason is that so far VPU's have been able to gain efficiency over the general model, largely by knowing that a particular operation only needs a particular precision (thereby saving ALU die area) or that a particular FIFO or cache only needs a particular number of bits (thereby saving buffer area).

The same logic applies to why pixels have become glued together in quads (to save control logic area and to improve coherence of memory accesses), although that is also to solve a particular issue (the derivative of dereferenced textures for mipmapping. You are right that the dynamic branching issue may force revisiting of this.

It's open as to what will be the best further down the line. Rest assured there are far smarter people than me who will be making that decision.
 
Simon F said:
I think you should ask yourself, "what would the image look like if all pixel pipelines were doing completely different things?"

Well, there would probably be some serious cache trashing going on and some registers in the pixel shaders stage would be hammered too if shared. So it would be slow, but beyond low FPS I can't see any problems for the image quality?
 
DiGuru said:
What do you mean?

Like, flow control would not be a good thing? Even for things like ray-tracing or procedural textures that are calculated per pixel when needed?

Or do you just mean that the image would look static, due to poor performance? Otherwise, I don't get it.
If all the pixels are doing completely independent things, the image would, very likely, look like a noisy mess.

You are going to get a reasonable amount of coherence from pixel to pixel and so, even for a ray tracer, you will have nearby rays behaving in similar ways and thus, to a reasonable extent, following the same code path. If they didn't, just consider how poor the performance would be on a CPU because the efficiency of its caches and branch prediction units would, otherwise, plummet.
 
Dio said:
DiGuru said:
Anyway, I think a lot of transistors in current GPU's are used for support mechanisms like Z-buffers, normal maps and other things to help the current model. With a new model, those could be used for other purposes.
This is the 'More generalisation for higher utilisation' argument. Surely, all you need are a bunch of FIFOs, a bunch of caches, and a bunch of general processors? It's been tried before, and so far it hasn't worked.

The main reason is that so far VPU's have been able to gain efficiency over the general model, largely by knowing that a particular operation only needs a particular precision (thereby saving ALU die area) or that a particular FIFO or cache only needs a particular number of bits (thereby saving buffer area).

I'm not suggesting no model (general purpose only), I'm suggesting a different model, one that has a better possibility to be used in a clever way and that offers interesting solutions to things that are hard to do with the current hardware.

The same logic applies to why pixels have become glued together in quads (to save control logic area and to improve coherence of memory accesses), although that is also to solve a particular issue (the derivative of dereferenced textures for mipmapping. You are right that the dynamic branching issue may force revisiting of this.

It's open as to what will be the best further down the line. Rest assured there are far smarter people than me who will be making that decision.

:D Don't tell me you don't have your own idea about it.

;)
 
Simon F said:
DiGuru said:
What do you mean?

Like, flow control would not be a good thing? Even for things like ray-tracing or procedural textures that are calculated per pixel when needed?

Or do you just mean that the image would look static, due to poor performance? Otherwise, I don't get it.
If all the pixels are doing completely independent things, the image would, very likely, look like a noisy mess.

You are going to get a reasonable amount of coherence from pixel to pixel and so, even for a ray tracer, you will have nearby rays behaving in similar ways and thus, to a reasonable extent, following the same code path. If they didn't, just consider how poor the performance would be on a CPU because the efficiency of its caches and branch prediction units would, otherwise, plummet.

Well, it depends on the model used. For example, when using ray-tracing on curved surfaces, the path followed can be quite different from pixel to pixel. With current hardware, you could emulate that by just following all possible paths. Or using a quad as the smallest unit. I don't think that would look nice.

Another thing that is very interesting right now is to calculate the texture per pixel on demand. So, no pre-calculated textures, but just a formula that calculates how the pixel looks when a certain (procedural) texture would be applied. Some things are rather hard to calculate purely mathematical and would require a lookup table (ie. a texture) or flow control to do right. Or you could (but you wouldn't want to) just calculate all iterations to be sure the result would be correct.

Anyway, the fact that the penalty of using flow control is quite large when not all pixels take the same execution path is why it is hard to continue with the current model.
 
Well, what do you think of the model? Nobody answered that yet.

Btw. Does anyone knows if something like this could be done with SM2.x/SM3.0?
 
Back
Top