The End of The GPU Roadmap

And remember kids the return of software only 3D solutions is only 5 years away, always
Guys, he really deserves a bit more credit than this. Programmable pixel, vertex and geometry processing are all part of a return to software rendering. It solved many of the limitations the developers of cutting-edge game engines were bumping into. So a big part of the predictions did come true, and you have to be quite a visionary for that.

And it's not like we've reached the endpoint yet. Direct3D 10 and 11 offer nothing new compared to Direct3D 9. You can do exactly the same things, only more efficiently. To truely allow doing things the classic rasterization pipeline isn't capable of, we need to evolve even closer toward software rendering. Larrabee is just one year away...

Back when GPUs added ever more fixed-function abilities (e.g. texture coordinate generation and bump mapping), they quickly realized that it was pointless to spend ever more silicon on individual features. Many features were left unused most of the time. So the solution was to use generic arithmetic operations to perform the calculations in 'software'. Nowadays we're at the point where Direct3D keeps adding ever more 3D pipeline stages, with many of them frequently left unused, but they still have an overhead in the hardware and the drivers. The solution is again to add more programmability. It allows to implement the current API pipeline more efficiently, but also to unleash many new abilities by giving more control to developers.

The first pixel shader model was hardly anything more than a language describing which texture stages to enable, and what operations they should perform. We're close to that point with the API's pipeline. Typically you'll still call certain API functions to enable or disable stages and tell them how to operate, but much of that can already be described in a type of scripting language that desribes what each stage does. Here's a glimpse at the first generation of truely programmable graphics pipelines: GRAMPS.

Personally I don't think the GPU will dissapear, and neither will APIs. But they'll be redefined in a way that we no longer recognise today's GPUs and APIs as the same thing. They already underwent a serious transformation with the move from fixed-function to shaders. And in this sense Sweeney was right all along.
 
Guys, he really deserves a bit more credit than this. Programmable pixel, vertex and geometry processing are all part of a return to software rendering.
are you defining software rendering as anything non fixed function ?

The first pixel shader model was hardly anything more than a language describing which texture stages to enable, and what operations they should perform.

out of interest i get 1400 fps on chameleon mark
 
Sorry Nick but when it comes to the return of software rendering Sweeney have lost his credits long time ago.

I agree that GPUs will become more and more programmable but we are still far away from dropping pipeline oriented rendering APIs in favor of a general multithread for many core processors approach. Most developers won’t go back to the stone age of rendering where you need to do anything by your own. We need higher abstraction levels not lower once.
 
If future processors dedicate most of their area to narrow issue SIMD or VLIW cores then GPUs will have won.

If future processors dedicate most of their area to wide issue superscalar cores CPUs will have won.
 
Sorry Nick but when it comes to the return of software rendering Sweeney have lost his credits long time ago.

I agree that GPUs will become more and more programmable but we are still far away from dropping pipeline oriented rendering APIs in favor of a general multithread for many core processors approach. Most developers won’t go back to the stone age of rendering where you need to do anything by your own. We need higher abstraction levels not lower once.

As Sweeney explains in his own paper. "modern" technology is a far cry from what is required to succeed in his own vision. in 2012 a low resolution of 1920x1200 requires far more bandwidth than anything that currently is on offer. 4Tbps and he wants to share this between CPU and GPU. He simply didn't realize what his words ten years ago would mean today.
Please show us that much vaunted REYES-renderer that works at 320x200@60Hz. Upscale it and compare Gears of War.
 
I agree that GPUs will become more and more programmable but we are still far away from dropping pipeline oriented rendering APIs in favor of a general multithread for many core processors approach. Most developers won’t go back to the stone age of rendering where you need to do anything by your own. We need higher abstraction levels not lower once.

Why do we need rendering APIs though? Don't they pretty much just define data formats and establish required hardware functionality? The rigid software pipeline is really just an artifact of limited hardware. As this hardware becomes faster and more programmable why would we need arbitrary software restrictions imposed by an API?
 
Carmack gave a significant amount of credit to OpenGL and DirectX for allowing highly concurrent programming to become commercially successful in the consumer market.

The beauty of it was that it did so before "moar corez" was the new hotness, and few even knew it.

As for why APIs do not die readily in the face of the manycore advance, let's look at how far we've advanced in core utilization the years since multi-core CPUs have been available to consumers.
Aside from apps that had an inherent affinity to it, multithreaded applications generally don't go beyond a few cores, and fall far short of many.

APIs are restrictive, but at the same time the level of success of the supposed free-form ideal is rather narrow.

APIs will take time to die in part because the hardware that originated under fixed-function regimes is not yet capable of full generality, and because programmers are seriously challenged by fully general concurrency and all the pitfalls it introduces. APIs instead are very carefully adding general concepts, perhaps in the hope of adding enough flexibility without reintroducing the same problems that many-core programmers using standard programming languages and practices hit.

This might be why there is the call for fully functional languages to replace the poorly encapsulated and side effect-ridden languages in general use, but that is a whole other category of things people have been saying will be dead in 10 years for decades.
 
The fact is MSAA is only useful for all kinds of rendering methods from D3D10.1 onwards, when both the hardware and the API allow the developer to get all of the MSAA sample data and use it as they please.

But if you try hard enough, obviously, you can break it nevertheless - or introduce at least some worth-mentioning-degree of trouble as in DICE's DX11 CS Deferred Rendering Experiment. (Link)
 
As for why APIs do not die readily in the face of the manycore advance, let's look at how far we've advanced in core utilization the years since multi-core CPUs have been available to consumers.

Isn't that more a function of hard-to-parallel tasks and limited hardware?

APIs will take time to die in part because the hardware that originated under fixed-function regimes is not yet capable of full generality, and because programmers are seriously challenged by fully general concurrency and all the pitfalls it introduces.

Well I'm not saying we should just strip away that level of abstraction, just that we should move it around. There's no reason for an engine designer to be hobbled by the restrictions of DirectX and OpenGL. These guys should be able to make full use of whatever techniques they deem most effective for lighting a pixel. The folks building their applications on top of that engine will still benefit from the same level of abstraction and ease of use as they do today. So instead of Microsoft, Nvidia and ATI handling that abstraction layer, leave it up to id, Epic, Crytek et al.
 
But if you try hard enough, obviously, you can break it nevertheless - or introduce at least some worth-mentioning-degree of trouble as in DICE's DX11 CS Deferred Rendering Experiment. (Link)
Are you referring to the point "Can't output to MSAA texture"on slide 31?

I'm not sure if he's saying that he can't write a colour/Z buffer pair with CS and then have the hardware do an automatic MSAA resolve, or if he means that CS isn't able to use MSAA hardware, for accelerating MSAA writes?

MSAA resolve isn't a meaningful bottleneck - and if you're doing any kind of HDR, then you don't want an automatic resolve. If he's referring to the inability to gain acceleration, well, CS isn't working on GPU-rasterised fragments, so that's not exactly surprising.

Overall, I can't tell what your point is (or what his point is). Maybe you'd like to quiz him about what he actually means?

Jawed
 
Why do we need rendering APIs though? Don't they pretty much just define data formats and establish required hardware functionality? The rigid software pipeline is really just an artifact of limited hardware. As this hardware becomes faster and more programmable why would we need arbitrary software restrictions imposed by an API?

The keyword is domain specific systems. We should stop to tell computers how they should do something. Instead we should tell them what they should do.

A good example where domain specific systems are already fully accepted are databases. A database programmer uses SQL to describe the data that’s needed. The advantage was that these queries can profit from multicore systems and other optimizations without the need to rewrite them.

On the other hand we have games that were written to the hard metal. Unfortunately this hard metal were single core processors not long ago. Therefore there were no profits from moving to dual core without a rewrite. To make the story even worse many developers haven’t learned the lesson and start to optimize for dual core. Again no profit from moving to even more cores.

If we translate this scenario to a massive many core add on chip (like a GPU) that could be programmed hard to the metal we will face the same situation. Programmers will optimize and add workarounds for a specific hardware generation. This will cause two problems. The first one is that software may not profit from future hardware improvements without rewrite. The second one is even more critical. Future hardware needs to be compatible to older one on a very deep level. This will get worse with every new hardware generation.

If we use a rendering API as interface the hardware can evolve more easily. It’s although possible to add support for new features for old games. Like AA and AF or even more complicated the “real” 3D support that nvidia is currently pushing.
 
Are you referring to the point "Can't output to MSAA texture"on slide 31?

I'm not sure if he's saying that he can't write a colour/Z buffer pair with CS and then have the hardware do an automatic MSAA resolve, or if he means that CS isn't able to use MSAA hardware, for accelerating MSAA writes?

MSAA resolve isn't a meaningful bottleneck - and if you're doing any kind of HDR, then you don't want an automatic resolve. If he's referring to the inability to gain acceleration, well, CS isn't working on GPU-rasterised fragments, so that's not exactly surprising.

Overall, I can't tell what your point is (or what his point is). Maybe you'd like to quiz him about what he actually means?

Jawed

The compute shader can’t write to a target that is multisampled. Therefore the result of a compute shader pass will always lost the multisampling information. This way you need to implement the downsample as part of your compute shader or downsample any source before (bad idea).
 
Isn't that more a function of hard-to-parallel tasks and limited hardware?
A task can be hard to make parallel for various reasons.
Some are intractibly serial.
Others have parallel implementations that are hard to code and validate.

Most research points to a decent amount of available parallelism across the board, but the software world has not caught up.

The big hurt for parallel scaling isn't expected until the "manycore" era, so the dual and quad solutions of contemporary designs represent the best opportunity for multi-core coding, and the opportunity thus far has been squandered.

Well I'm not saying we should just strip away that level of abstraction, just that we should move it around. There's no reason for an engine designer to be hobbled by the restrictions of DirectX and OpenGL.
DirectX and OpenGL would be what the bulk of the graphics programming world knows and has invested quite a lot into.
That amount of inertia does not turn on a dime because of what engine designers want.

These guys should be able to make full use of whatever techniques they deem most effective for lighting a pixel.
What they deem most effective would depend on their criteria.
At least from the point of view of performance, the hardware they'd be targeting would itself be targeting the dominant graphics workloads: those running on the dominant APIs.

The folks building their applications on top of that engine will still benefit from the same level of abstraction and ease of use as they do today. So instead of Microsoft, Nvidia and ATI handling that abstraction layer, leave it up to id, Epic, Crytek et al.
For the rest of the software world, the value of standards would make it difficult for a bunch of fractious game engine designers to provide a unified front.
The competence of each small player in defining a robust API is also in question.
It took MS quite a few tries, for example.
 
The compute shader can’t write to a target that is multisampled. Therefore the result of a compute shader pass will always lost the multisampling information. This way you need to implement the downsample as part of your compute shader or downsample any source before (bad idea).
Is there a disadvantage to performing the downsample of MSAA data as part of CS?

What kind of rendering configuration would make the ability of CS to write/re-write MSAA render targets very useful, in comparison with what D3D11 currently allows?

As far as I can tell the CS pass can output multisample frequency data, it just can't write it to buffers that the GPU understands as MSAA buffers. CS could output "64x MSAA" if it wanted, but the hardware could do nothing but treat the colour and Z portions as 2D resources. If the CS pass writes a new buffer, the existing Z buffer isn't destroyed.

So any further passes, e.g. shadows and transparency/particle-effects, can still access the original Z buffer at the original MSAA frequency. A final compositing CS should then be able to bind all these buffers together into a final picture, with a tone-map and downsample.

Filling Z, originally, as triangles are rasterised, is where the MSAA expense lies. I don't understand where the high-cost/misfortune arises when CS is unable to generate/re-write MSAA buffers (i.e. the special combination of 1-8 colour targets and Z).

CS, by its very nature, is going to be "slower" than pixel shading, when writing 2D buffers. All the techniques that pixel shading writes benefit from (colour compression, tiling alignments etc.) are nominally "lost". Though I suppose it's possible for clever drivers to tune the way memory is handled, at least.

Jawed
 
Is there a disadvantage to performing the downsample of MSAA data as part of CS?

Not a real one. You just need to build an own shader for every number of samples you want to support. But you can do this based on a single source file. It works fine for us.

What kind of rendering configuration would make the ability of CS to write/re-write MSAA render targets very useful, in comparison with what D3D11 currently allows?

It could be useful from a performances point of view if you could invoke the compute shader based on the coverage information. In this case you would not calculate the same value for each identical sample.

As far as I can tell the CS pass can output multisample frequency data, it just can't write it to buffers that the GPU understands as MSAA buffers. CS could output "64x MSAA" if it wanted, but the hardware could do nothing but treat the colour and Z portions as 2D resources. If the CS pass writes a new buffer, the existing Z buffer isn't destroyed.

Sure you can safe it as super sampling buffer.

So any further passes, e.g. shadows and transparency/particle-effects, can still access the original Z buffer at the original MSAA frequency. A final compositing CS should then be able to bind all these buffers together into a final picture, with a tone-map and downsample.

Filling Z, originally, as triangles are rasterised, is where the MSAA expense lies. I don't understand where the high-cost/misfortune arises when CS is unable to generate/re-write MSAA buffers (i.e. the special combination of 1-8 colour targets and Z).

I never said that it would cause high costs. It would just a somewhat more complete solution.

CS, by its very nature, is going to be "slower" than pixel shading, when writing 2D buffers. All the techniques that pixel shading writes benefit from (colour compression, tiling alignments etc.) are nominally "lost". Though I suppose it's possible for clever drivers to tune the way memory is handled, at least.

Jawed

Rectangle thread groups can help to ensure that the target tiles are calculated together. I am excreted anyway that command processors that invoke the compute shaders supports different patterns. So drivers can tune the pattern by analyzing the shader or just keeping lists of known shaders.
 
The fact is MSAA is only useful for all kinds of rendering methods from D3D10.1 onwards, when both the hardware and the API allow the developer to get all of the MSAA sample data and use it as they please. Not sure which version of OpenGL allows equivalent access...

OpenGL 3.2 / GL_ARB_texture_multisample

Direct3D 10 and 11 offer nothing new compared to Direct3D 9.

wut?

Why do we need rendering APIs though?

Productivity, vendor independency and forward compatibility.
 
If we translate this scenario to a massive many core add on chip (like a GPU) that could be programmed hard to the metal we will face the same situation. Programmers will optimize and add workarounds for a specific hardware generation. This will cause two problems. The first one is that software may not profit from future hardware improvements without rewrite. The second one is even more critical. Future hardware needs to be compatible to older one on a very deep level. This will get worse with every new hardware generation.

But that's no different than Nvidia or ATI doing those same things via their drivers for each hardware generation. And doesn't backward compatibility become less and less of an issue as hardware becomes more programmable? At some point graphics hardware will support just a few primitive types and a few operations just like x86 today. Isn't stuff like CUDA a big step in that direction?

DirectX and OpenGL would be what the bulk of the graphics programming world knows and has invested quite a lot into. That amount of inertia does not turn on a dime because of what engine designers want.

Agreed, but there will be some overlap when one day somebody decides to write an engine in OpenCL and bypass the graphics APIs completely. Depending on the success of that effort we could see it become the standard approach after a while.

What they deem most effective would depend on their criteria.
At least from the point of view of performance, the hardware they'd be targeting would itself be targeting the dominant graphics workloads: those running on the dominant APIs.

Not sure I agree with that, it seems like hardware is becoming less attached to those APIs with every passing generation. Take Larrabee for example.

Productivity, vendor independency and forward compatibility.

Why aren't any of those things a problem on x86? Productivity is there through code reuse, vendor independency is there through standardization of the basic instruction set and forward compatibility is not an issue because your starting point is general programmability. Sure x86 is standardized at the hardware level but I see no reason why we can't have that sort of standardization at the programming interface level (CUDA/OpenCL) instead of restrictive high level graphics APIs.
 
Why aren't any of those things a problem on x86?

Err what makes you think this is true? When Humus mentions "forward compatibility", I think he means forward compatibility with reasonable performance. Larrabee supports x86 (to a degree), but how well do you think it will run x86 code written a decade ago? Conversely my R700 runs Half-Life (a decade old game) very well.
 
Agreed, but there will be some overlap when one day somebody decides to write an engine in OpenCL and bypass the graphics APIs completely. Depending on the success of that effort we could see it become the standard approach after a while.
At least at the outset, it appears that the big push is to use CS as a graphics pipeline adjunct, not as a standalone.
That might indicate that the going purely CS bypasses a lot of on-chip resources and might leave an OpenCL renderer gimped in comparison.

Even when such an inflection point comes, the idea of writing a low-level rendering engine would probably scare the pants off of 90% of the developers out there that derive little value from such work and don't want to pay another company like Epic for the priveledge of using a graphics pipeline. Here, the framework and abstractions the APIs provide for free do well enough.

Not sure I agree with that, it seems like hardware is becoming less attached to those APIs with every passing generation. Take Larrabee for example.
Well, to be fair, it is rather hard to be attached to an API when the hardware predates the API (other than the TMUs, perhaps).
 
At least at the outset, it appears that the big push is to use CS as a graphics pipeline adjunct, not as a standalone.
That might indicate that the going purely CS bypasses a lot of on-chip resources and might leave an OpenCL renderer gimped in comparison.

The reality is that the only thing you miss out on via CS/OCL atm is the texture sampler interfaces and the ROP functionality.

ROPs I think will go away eventually. I think both CS and OCL will eventually get support for the full texture sampler functionality, effectively killing SM/OGL X.X at that point forward (ie, SM and OGL spec development will die).


Even when such an inflection point comes, the idea of writing a low-level rendering engine would probably scare the pants off of 90% of the developers out there that derive little value from such work and don't want to pay another company like Epic for the priveledge of using a graphics pipeline. Here, the framework and abstractions the APIs provide for free do well enough.

the reality is, how many people actually write their own engines anymore? And while people might not want to pay EPIC for their engine, they are just most likely paying some other company for their engine.

The engine per game era died a long time back to be replaced with the licensed engine era.

Given an existing CS/OCL/Metal engine and source, its is no harder to modify it than it is to roll your own engine using D3D X.X.
 
Back
Top