The End of The GPU Roadmap

my R700 runs Half-Life (a decade old game) very well.

And i7 runs code that used to run on a 386.

At least at the outset, it appears that the big push is to use CS as a graphics pipeline adjunct, not as a standalone. That might indicate that the going purely CS bypasses a lot of on-chip resources and might leave an OpenCL renderer gimped in comparison.

That's pretty short term. The only reason it bypasses a lot of on-chip resources is because those resources are fixed function. And there will be fewer and fewer of those in the future.

Even when such an inflection point comes, the idea of writing a low-level rendering engine would probably scare the pants off of 90% of the developers out there that derive little value from such work and don't want to pay another company like Epic for the priveledge of using a graphics pipeline. Here, the framework and abstractions the APIs provide for free do well enough.
So we're going to leave it to Microsoft, AMD, Nvidia and Intel to define and implement every possible rendering technique that becomes feasible on future hardware? The reason it's doable now is that the set of feasible approaches is tiny. As that set grows it's highly improbable that people would be happy with whatever canned implementations those guys come up with.

Well, to be fair, it is rather hard to be attached to an API when the hardware predates the API (other than the TMUs, perhaps).
What aspect of DX11 does Larrabee not support?
 
And i7 runs code that used to run on a 386.

Stop being petty. If one is trying to get decent performance out of any modern Intel cpu, they're using one of the SSE instruction sets. Limiting the i7 to the instruction set found on the i386 significantly limits performance. Half-Life does nothing that limits performance on my R700.
 
Larrabee supports x86 (to a degree), but how well do you think it will run x86 code written a decade ago?

Larrabee is an x86 CPU (sort of), which is not meant to run any code not written for the massively parallel era. Ergo, code written a decade ago,for x86 or otherwise, doesn't count for anything, except for marketing weasels.
 
Stop being petty. If one is trying to get decent performance out of any modern Intel cpu, they're using one of the SSE instruction sets. Limiting the i7 to the instruction set found on the i386 significantly limits performance. Half-Life does nothing that limits performance on my R700.

He's not being petty. For the one being petty, a mirror might help you.

The i7 will run everything that ran on the 386 and run it incredibly fast. And no, you don't need to use SSE instructions to get decent performance on x86. I'd wager that the vast majority of application written both before and after the introduction of SSE don't use SSE and they still get good performance. In addition, I'd wager that the vast majority of application developed over the past 5 years don't use anything really beyond the 32b x86 instruction set. 64b programs are still in the vast minority and the same goes for all the other extensions.

As for your final sentence, how do you know?
 
Larrabee is an x86 CPU (sort of), which is not meant to run any code not written for the massively parallel era. Ergo, code written a decade ago,for x86 or otherwise, doesn't count for anything, except for marketing weasels.

Actually, I'd wager it does as well if not better on x86 code written a decade ago as any hardware that was available a decade ago.

We're talking about a best case of a 700 MHz K7 or a 600 MHz P3.
 
Actually, I'd wager it does as well if not better on x86 code written a decade ago as any hardware that was available a decade ago.

We're talking about a best case of a 700 MHz K7 or a 600 MHz P3.
Try running Windows 95 and see what happens :D

Edit: Okay okay you did say a decade ago, so try Windows Millennium ;)
 
I agree that GPUs will become more and more programmable but we are still far away from dropping pipeline oriented rendering APIs in favor of a general multithread for many core processors approach. Most developers won’t go back to the stone age of rendering where you need to do anything by your own. We need higher abstraction levels not lower once.
They won't need to do anything on their own. But they could if they wanted to.

Is desktop application development in the stone age because CPUs are fully generic? No, we just have many different levels of abstraction to cope with that complexity. The majority of developers use a language like C# that does several things automatically and comes with a framework library with tons of standard functionality. Performance oriented applications are often written in C++ to avoid certain overhead, but there are still numerous libraries readily available. The libraries themselves are typically written in C for optimal control. And several functions withing the libraries are written in assembly for ultimate performance.

Right now graphics APIs are your only choice when trying to access the GPU. It's a level of abstraction above C#. But we have a level of abstraction above the API as well: the engines. You don't necessarily have to reinvent the wheel if you want certain graphics effects. But because there's no level below the API we're severely limited in the layers above it as well...

Most developers will continue to use engines and APIs. Direct3D will just be a library running on the GPU. But next to that library we can have many others as well. We can have a raytracing library, a REYES library, a physics library, a volumetric rendering library, an A.I. library, a video processing library, etc. And there won't be just one choice of library; there will be many, with different feature sets and prices, or you can roll your own.

So I don't think Sweeney is trying to forecast the end of the API, nor of the GPU. He's merely forecasting the end of the classic meaning of the API as the only means to communicate with the hardware. The majority of developers will still use an API, but it will no longer dominate what you can do. And in that sense the API as we know it today will cease to exist. Likewise, GPUs are changing radically from obeying the API's pipeline architecture to becoming fully programmable and capable of implementing any API or library.

"We need great tools: compilers, engines, middleware libraries..."
 
Actually, I'd wager it does as well if not better on x86 code written a decade ago as any hardware that was available a decade ago.

We're talking about a best case of a 700 MHz K7 or a 600 MHz P3.

It may be able to compete with them based on it's sheer clock speed. But that doesn't mean it is meant to run powerpoint. Sure, there will be an oddball demo running win/lin/bsd there, but that doesn't mean it is meant to do that.
 
Right now graphics APIs are your only choice when trying to access the GPU. It's a level of abstraction above C#. But we have a level of abstraction above the API as well: the engines. You don't necessarily have to reinvent the wheel if you want certain graphics effects. But because there's no level below the API we're severely limited in the layers above it as well...

Most developers will continue to use engines and APIs. Direct3D will just be a library running on the GPU. But next to that library we can have many others as well. We can have a raytracing library, a REYES library, a physics library, a volumetric rendering library, an A.I. library, a video processing library, etc. And there won't be just one choice of library; there will be many, with different feature sets and prices, or you can roll your own.

So I don't think Sweeney is trying to forecast the end of the API, nor of the GPU. He's merely forecasting the end of the classic meaning of the API as the only means to communicate with the hardware. The majority of developers will still use an API, but it will no longer dominate what you can do. And in that sense the API as we know it today will cease to exist. Likewise, GPUs are changing radically from obeying the API's pipeline architecture to becoming fully programmable and capable of implementing any API or library.
API by definition means "application programming interface" it doesn't mean Direct3D, OpenGL or even anything remotely connected to graphics (say WinSock API). OpenCL or CUDA are stil APIs and are even on the same level as Direct3D or OpenGL are. Both provide abstract access to hardware the same way D3D or OpenGL are. The difference is however that OpenCL/CUDA are compute centric and not graphics centric. It's not in any way that Direct3D is on top of CUDA for example.

Now we also have D3D 11 which provides these same compute capabilities. So how is D3D 11 any less capable then OpenCL/CUDA? You can still use rasterisation if you want or you can program what ever the hell you want on top of compute.

Sure if NV and ATI decide to drop stuff like triangle setup and new tessellation stages from their GPUs and become "Larrabee" then there's little point for them to provide D3D implementation except for D3D compute.
You could then have third parties that would develop rasterisation libraries, raytracing libraries, video codec libraries, sound libraries,... And your fourth parties that would develop and sell game engines such as Epic.

But how is this different to the situation we have today? Nobody is forcing you to use D3D or OpenGL. You can develop your render library using x86, SSE, CUDA, OpenCL,... and sell it to who ever is interested. It's hard to say that D3D as such will die, because D3D as such has died a couple of times already and has been ripped up and rebuild for years. If anything then current "DrawPrimitive*" will be moved to D3DX...

The point I'm trying to make is that people that have time to develop their raster/raytracing/whatever libraries from scratch can do so today. These people probably work at Epic, ID and Gamebryo. These libraries won't see the light of day as independent libraries, but will be packed somewhere deep in their complete game engines and will be sold as such.

It's actually interesting that successful third parties developing what you could call "low level middleware" such as physics engines get bought up by bigger players anyway. Havok got bought by Intel, Ageia got bought by NVidia, Pixomatic guys got bought by Intel, Project Offset got bought by Intel, developer People Can Fly got acquired by Epic,...
 
API by definition means "application programming interface" it doesn't mean Direct3D, OpenGL or even anything remotely connected to graphics (say WinSock API). OpenCL or CUDA are stil APIs and are even on the same level as Direct3D or OpenGL are. Both provide abstract access to hardware the same way D3D or OpenGL are. The difference is however that OpenCL/CUDA are compute centric and not graphics centric. It's not in any way that Direct3D is on top of CUDA for example.
I know what an API is. I was mainly talking about them in the context of graphics though. The problem is that there is currently no way around using the graphics APIs implemented by the IHVs. You can implement things on top of OpenCL or CUDA, but performance will be far from ideal. In many ways, they are just byproducts of the Direct3D pipeline.

The problem is not so much the software side, but the hardware. It is incapable of efficiently supporting anything that deviates significantly from the Direct3D pipeline. Today's hardware still dedicates massive amounts of silicon to ROPs, texture samplers, raterization, etc. That's pretty useless for anything other than Direct3D. Furthermore, the available register space is tiny. It's fine for processing vertices and pixels, but any algorithm that involves deep function calls is out of the question. Lastly, current GPGPU APIs are incapable of letting tasks spawn tasks. There's always a round trip to the CPU. So fine-grained control is not possible either. And GPUs have to become capable of compiling their own code as well before we can break free from all API restrictions.
Now we also have D3D 11 which provides these same compute capabilities. So how is D3D 11 any less capable then OpenCL/CUDA? You can still use rasterisation if you want or you can program what ever the hell you want on top of compute.
...as long as it's no more complex than vertex or pixel processing.
Sure if NV and ATI decide to drop stuff like triangle setup and new tessellation stages from their GPUs and become "Larrabee" then there's little point for them to provide D3D implementation except for D3D compute.
You could then have third parties that would develop rasterisation libraries, raytracing libraries, video codec libraries, sound libraries,... And your fourth parties that would develop and sell game engines such as Epic.

But how is this different to the situation we have today? Nobody is forcing you to use D3D or OpenGL. You can develop your render library using x86, SSE, CUDA, OpenCL,... and sell it to who ever is interested.
It's different because when you start to deviate from the Direct3D pipeline on today's GPUs, performance plummets. Back when GPGPU was all the hype, they compared performance to a Pentium 4. Occasionally they had a 3x higher performance. Nowadays we compare it to a Core i7, and the only GPGPU applications that managed to survive are those that are highly graphics related.

An architecture like Larrabee won't have that problem. It combines the throughput of a GPU with the ultimate programmability of a CPU. What developers want isn't super-fast tesselation and anything else running like a snail. They want a wide variety of algorithms to run at predictable performance without having to understand the hardware details.
The point I'm trying to make is that people that have time to develop their raster/raytracing/whatever libraries from scratch can do so today.
Sure. If they're happy with abysmal performance and are willing to spend months trying to fit an algorithm to the GPU architecture...

Direct3D is making the same mistakes again. While something like OpenCL actually has a thin shell for managing data buffers and kernels, with most of the action taking place in the CL programming language, Direct3D is getting fatter again. In the early days it added ever more fixed-function features. Now they ditched all that, but instead they're adding ever more pipeline stages. Soon they'll realize it's not worth the overhead and they have to ditch all that in favor of fully programmable graphics pipelines, eventually evolving to software rendering.
 
The point I'm trying to make is that people that have time to develop their raster/raytracing/whatever libraries from scratch can do so today. These people probably work at Epic, ID and Gamebryo. These libraries won't see the light of day as independent libraries, but will be packed somewhere deep in their complete game engines and will be sold as such.

As by your own point, these people probably won't all work at Epic, ID or Crytek just yet (I'm replacing Gamebryo with Crytek here). Because by your own following point, innovations are much more likely to come from smaller, more flexible companies. Or perhaps companies with a more clearly vested interest in innovation, like the Larrabee teams. But most of the time, small companies - and they can do great things. Slightly left-field from this topic, but I'm still impressed by what a small team like Media Molecule managed to do with LittleBigPlanet.

It's actually interesting that successful third parties developing what you could call "low level middleware" such as physics engines get bought up by bigger players anyway. Havok got bought by Intel, Ageia got bought by NVidia, Pixomatic guys got bought by Intel, Project Offset got bought by Intel, developer People Can Fly got acquired by Epic,...

Or perhaps companies with a more clearly vested interest in innovation, like the Larrabee teams. But most of the time, small companies - and they can do great things. Slightly left-field from this topic, but I'm still impressed by what a small team like Media Molecule managed to do with LittleBigPlanet.

Heck, even I'm currently thinking about how I would design an engine 'fresh', and I've got just one tiny little insignificant OpenGL test project under my belt. There must be hundreds of others doing the same thing, and it will increase manifold of stuff like LRB becomes available to home programmers, just because they c
 
ROPs I think will go away eventually. I think both CS and OCL will eventually get support for the full texture sampler functionality, effectively killing SM/OGL X.X at that point forward (ie, SM and OGL spec development will die).
I think SM and OGL might live longer than that. Their presence as long-time standards will take time to dissipate. The leading edge of developers might hop off, but the rest of the market will want elaboration on their established code bases.

The engine per game era died a long time back to be replaced with the licensed engine era.

Given an existing CS/OCL/Metal engine and source, its is no harder to modify it than it is to roll your own engine using D3D X.X.
It's just as easy to modify an engine using a proprietary to-the-metal implementation as it is to modify an engine using universally known and widely accepted standard?

I might not be able to find someone who has fiddled with EPIC Engine6, whereas there is a larger pool of people and likely much better documentation associated with DirectX13.

I recall there was a spate of problems with the initial roll-out of the UE and the engine-maker's possible conflicts of interest in supporting its own game rollout versus supporting the customers it essentially competed with.

Also:

Better than that, it can still run code that ran on an 8088 and 8086! That's 25+ years!
I don't think Intel gambled on there being complete compatibility with chips one generation removed, ISA compatibility or not.
An i7 is so removed that there are likely hardware behaviors, bugs, and changed specifications in the way.

As to Larrabee:
Actually, I'd wager it does as well if not better on x86 code written a decade ago as any hardware that was available a decade ago.

We're talking about a best case of a 700 MHz K7 or a 600 MHz P3.
I just don't know about this one. It might be close in a few situations, and a loss in others.
On 10-year old code it's single-threaded code.
One Larrabee core can't run any multimedia extensions, is in-order, has a much narrower issue width (probably 1/3), awful branch prediction, and far fewer execution resources.
Assuming no crashing on instructions that came about after the P54 or P55 cores, this might still be balanced in some cases by an on-die L2 cache of 256KiB and the on-die memory controller and higher bandwidth.


And i7 runs code that used to run on a 386.
To be somewhat nitpicky, it might run code that used to run on a 386.
There are low-level implementation details that have not stayed constant over the years that may or may not scuttle the attempt, assuming everything else in the system besides the CPU doesn't mess something up.

So we're going to leave it to Microsoft, AMD, Nvidia and Intel to define and implement every possible rendering technique that becomes feasible on future hardware? The reason it's doable now is that the set of feasible approaches is tiny. As that set grows it's highly improbable that people would be happy with whatever canned implementations those guys come up with.
One of the benefits I see of the current system is the amount of steering that Nvida and ATI/AMD had in determining additions that are feasible in hardware, which balanced nicely with the input software developers wanted, which provided a sanity check in the evolution of computer graphics.
I'm not entirely impressed with Sweeny's last decade of prognostications about what hardware was supposed to be able to do.

What aspect of DX11 does Larrabee not support?
To my knowledge, it should support it fully, at least in software.
The texture units are something of an unknown, but any minor case they might miss could fall back to software.


An architecture like Larrabee won't have that problem. It combines the throughput of a GPU with the ultimate programmability of a CPU. What developers want isn't super-fast tesselation and anything else running like a snail. They want a wide variety of algorithms to run at predictable performance without having to understand the hardware details.
Mostly, sort of.
It appears that it will have 1/2 to 1/3 of the throughput of the top GPUs in 2010 with the more robust programmability of a CPU. We'll need to see how things sort out with the actual implementations for sale.
As for the expectation of predictable performance, that would be a revolution given how poorly the crystal ball works on current multicores.
 
I don't think Intel gambled on there being complete compatibility with chips one generation removed, ISA compatibility or not.
An i7 is so removed that there are likely hardware behaviors, bugs, and changed specifications in the way.

I think at worst you would have to search long and hard to find any x86 code that won't run on a modern x86 from either intel or amd. The only cases I can think of is cases where it relies on some obscure piece of hardware that isn't actually part of x86.

Backwards compatibility is a BIG deal in x86 land.

One of the benefits I see of the current system is the amount of steering that Nvida and ATI/AMD had in determining additions that are feasible in hardware, which balanced nicely with the input software developers wanted, which provided a sanity check in the evolution of computer graphics.
I'm not entirely impressed with Sweeny's last decade of prognostications about what hardware was supposed to be able to do.

The major detriment to the current system is that it is effectively a mono-culture where if either ATI or Nvidia say they aren't going to do something, it doesn't make it into the spec. The problem is that the hardware and the software is still very much fixed function.

There are myriad known solutions to a variety of 3d problems that cannot be implemented currently because the hardware is still so fixed function.
 
They can only give up so much performance in running older games each generation for increased programmability ... they'll get there eventually, patience :)
 
Back when GPGPU was all the hype, they compared performance to a Pentium 4. Occasionally they had a 3x higher performance. Nowadays we compare it to a Core i7, and the only GPGPU applications that managed to survive are those that are highly graphics related.

I don't think that's quite true, unless you think of physical simulation and signal processing as being "graphics related". For work that I'm doing, Teslas are more than an order of magnitude faster than i7s, and I can get a lot more Teslas into a rack, for less money.
 
Are you referring to the point "Can't output to MSAA texture"on slide 31?

I'm not sure if he's saying that he can't write a colour/Z buffer pair with CS and then have the hardware do an automatic MSAA resolve, or if he means that CS isn't able to use MSAA hardware, for accelerating MSAA writes?

MSAA resolve isn't a meaningful bottleneck - and if you're doing any kind of HDR, then you don't want an automatic resolve. If he's referring to the inability to gain acceleration, well, CS isn't working on GPU-rasterised fragments, so that's not exactly surprising.

Overall, I can't tell what your point is (or what his point is). Maybe you'd like to quiz him about what he actually means?

Jawed

A DX11 CS can't write to an MSAA UAV, which is what I listed in my presentation as a bad thing. In practice this isn't that much of a problem just as you mentioned as you typically want to do the resolve inside a CS anyway.

But for orthogonality it would still be nice to have support for MSAA UAVs, for example in my first deferred shading CS implementation I replaced our light accumulation and deferred shading combine pass with a single CS, this needs to read MSAA subsamples which it can but ideally here (with this specific setup) we would like to output to an MSAA UAV as after the combine pass we render all forward-rendered opaque (unlit) & transparent surfaces and those would ideally be able to render in MSAA straight on top of the MSAA deferred shading combined surface.

In that first implementation I just disable MSAA for those forward surfaces which wasn't that much of a loss asmost of our transparent surfaces are rendered in half-res & upsampled anyway.

An proper alternative implementation for this to keep MSAA fully in the entire rendering pipeline together with a deferred shading CS would be, just as someone mentioned here, to render out the forward/transparent surfaces to a separate MSAA render target just after the gbuffers have been rendered (so we have z). And then composite that together with the deferred shading result directly in the CS. Requires more memory but can also allow the CS to act as a final post-processing/tonemapping shader if one wishes to.
 
I don't think that's quite true, unless you think of physical simulation and signal processing as being "graphics related".
Sure, there are exceptions to the rule. Some workloads are just as embarassingly parallel as graphics.

The general rule though is that most parallel algorithms require a fine level of control over task and data management. So they don't map well to current GPU architectures. Especially within a game engine the amount of independent parallel work at any point is limited.
For work that I'm doing, Teslas are more than an order of magnitude faster than i7s, and I can get a lot more Teslas into a rack, for less money.
Out of curiosity, what programming language did you use?
 
Back
Top