The End of The GPU Roadmap

Mintmaster · Aug 20, 2009

neliz said:
(Jan 2000)
2006-7:CPU’s become so fast and power ful that 3D hardware will be only marginally benﬁcal for rendering relative to the limits of the human visualsystem, therefore 3D chips will likely be deemed a waste of silicon (andmoreexpensivebusplumbing), so the world will transition back to software-driven rendering
http://www.scribd.com/doc/93932/Tim-Sweeney-Archive-Interviews

Nicely done, neliz.

Nick said:
Back when GPUs added ever more fixed-function abilities (e.g. texture coordinate generation and bump mapping), they quickly realized that it was pointless to spend ever more silicon on individual features. Many features were left unused most of the time. So the solution was to use generic arithmetic operations to perform the calculations in 'software'. Nowadays we're at the point where Direct3D keeps adding ever more 3D pipeline stages, with many of them frequently left unused, but they still have an overhead in the hardware and the drivers. The solution is again to add more programmability. It allows to implement the current API pipeline more efficiently, but also to unleash many new abilities by giving more control to developers.

There's a big difference, though. Back then D3D was adding stages/abilities to each element, whereas now they're adding stages at the high level. The pixel/vertex pipelines became programmable because hardware was already implementing features this way. The original Radeon has pixel and vertex shaders in hardware very similar to DX8 class hardware like the GF3.

The most important thing in hardware design is data flow. Going from DX7 to DX8/DX9 wasn't very hard because the data going in and out of each pixel or vertex computation barely changed. DX9c/DX10 added a wrinkle with texture data flow into the vertices. What you're proposing, though, makes a complete mess of all the nice constraints of data flow that make GPUs so efficient at graphics.

While I agree that it will eventually happen, I disagree about the cause. It won't be for efficiency/simplification or by graphics programmer demand. The only way I see the GPU pipeline going away is through drastic shifts in the market, i.e. stream computing becomes maybe 50% of the GPU market through growth of the former coupled with shrinkage of the latter due to integrated graphics becoming increasingly adequate.

I think the tipping point will be the development of the last mass-market, high demand (hence high margin) application for HPC in our society: Automated vehicle driving.

EDIT:

Nick said:
The problem is not so much the software side, but the hardware. It is incapable of efficiently supporting anything that deviates significantly from the Direct3D pipeline. Today's hardware still dedicates massive amounts of silicon to ROPs, texture samplers, raterization, etc. That's pretty useless for anything other than Direct3D.

This is exactly the point I was making earlier. The pixel and vertex pipelines became programmable because it was basically already structured that way to support the features that consumers were projected to want. GF3 had to add dependent texturing, but ATI barely did anything when going to a DX8 GPU. For DX10, unified pipelines had marginal overall cost because the efficiency gain from load balancing negated the overhead of generality.

All the things you mentioned, though, will never have any economic drive to become generalized or merged. People have been talking about shader based texture filtering for ages but it's not going to happen because you don't come close to winning. If you removed ROPs then you'd have to add some other way of writing data to memory and dealing with generalized read after write hazards, so you're not going to get close to saving anything there. Rasterization and Z-testing has very well defined data flow with very different data formats/precision of the general computation units, so just like filtering it makes no sense to generalize and never will unless the hardware idles >99% of the time. The only thing that has a chance of generalizing, IMO, is triangle setup.

Intel is going to use its manufacturing edge to not get completely blown out of the water with Larrabee, but nobody else can generalize without committing suicide.

paawl · Aug 20, 2009

Nick said:
Sure, there are exceptions to the rule. Some workloads are just as embarassingly parallel as graphics.

Sure, many problems are embarrassingly parallel--like graphics--, and, perhaps unsurprisingly, these are the problems that tend to work well on GPUs. I was just objecting to your assertion that only graphics-related problems worked on GPGPUs. If, by "graphics related", you meant embarrassingly parallel, then I drop my objection.

The general rule though is that most parallel algorithms require a fine level of control over task and data management. So they don't map well to current GPU architectures. Especially within a game engine the amount of independent parallel work at any point is limited.

I don't know much about game engines, but I'm not surprised that they're not a good fit to GPGPUs (or Cell). I believe that MS concluded that six to eight threads were all that were justifiable for the portion of a game engine that runs on the CPU. Tim Sweeney just confirmed that by saying that UE3 uses up to six threads, and I've read similar comments from other developers. (Ironically, mostly PS3 developers.)

Out of curiosity, what programming language did you use?

We prototyped in MATLAB and then coded the GPU portions in CUDA, and the CPU portions in C.

dkanter · Aug 20, 2009

Mintmaster said:
Nicely done, neliz.
All the things you mentioned, though, will never have any economic drive to become generalized or merged. People have been talking about shader based texture filtering for ages but it's not going to happen because you don't come close to winning. If you removed ROPs then you'd have to add some other way of writing data to memory and dealing with generalized read after write hazards, so you're not going to get close to saving anything there. Rasterization and Z-testing has very well defined data flow with very different data formats/precision of the general computation units, so just like filtering it makes no sense to generalize and never will unless the hardware idles >99% of the time. The only thing that has a chance of generalizing, IMO, is triangle setup.

Intel is going to use its manufacturing edge to not get completely blown out of the water with Larrabee, but nobody else can generalize without committing suicide.

I dunno, most of the ex-SGI'ers that are not at ATI or NV (and have unbiased opinions) I know seemed to think that software rasterizing was no big deal...

David

nAo · Aug 20, 2009

dkanter said:
I dunno, most of the ex-SGI'ers that are not at ATI or NV (and have unbiased opinions) I know seemed to think that software rasterizing was no big deal...

Although to be fair there is a bit more about LRB than just software rasterization

Ailuros · Aug 20, 2009

nAo said:
Although to be fair there is a bit more about LRB than just software rasterization

A long line of revisions?

Nick · Aug 20, 2009

Mintmaster said:
There's a big difference, though. Back then D3D was adding stages/abilities to each element, whereas now they're adding stages at the high level. The pixel/vertex pipelines became programmable because hardware was already implementing features this way. The original Radeon has pixel and vertex shaders in hardware very similar to DX8 class hardware like the GF3.

Those high level stages still come at a cost. Else a GeForce 6800 would have been capable of geometry shading, so to speak. NVIDIA and ATI have to change their architecture to support Direct3D 11 and will have to do so again for Direct3D 12, etc. At some point all these added costs don't make sense any more. By making a fully programmable and flexible architecture they can lower the overall cost and support a wider range of workloads. In fact this evolution has been going on for a while, but very slowly. Shader Model 2.0 made all registers floating-point. Shader Model 3.0 allows arbitrary register semantics and unified vertex and pixel processing. Shader Model 4.0 demands storing arbitrary amounts of data. All those things came at a significant cost, but GPUs have still become faster with every generation. So it's not all that revolutionary to take the step toward something as programmable as a CPU, without necessarily losing performance.

What you're proposing, though, makes a complete mess of all the nice constraints of data flow that make GPUs so efficient at graphics.

Are they? As soon as you do something a little less conventional, performance starts to drop. Even for graphics. The definition of what we call graphics has just become more diverse.

Think of pixel shaders. Back in the fixed-function days, very little people had any idea about the importance of programmable shading. Who needs anything more than a lightmap and maybe a bumpmap, right? Nowadays, the ability to run complex shaders with hundreds of instructions seems the most obvious thing.

Software rendering will unleash an even bigger revolution. We don't fully know yet what developers will come up with, only that in a few years time we won't be able to live without it any more.

All the things you mentioned, though, will never have any economic drive to become generalized or merged. People have been talking about shader based texture filtering for ages but it's not going to happen because you don't come close to winning. If you removed ROPs then you'd have to add some other way of writing data to memory and dealing with generalized read after write hazards, so you're not going to get close to saving anything there. Rasterization and Z-testing has very well defined data flow with very different data formats/precision of the general computation units, so just like filtering it makes no sense to generalize and never will unless the hardware idles >99% of the time. The only thing that has a chance of generalizing, IMO, is triangle setup.

Never say never. It took only about 7 years to go from highly dedicated fixed-function T&L to fully programmable unified vertex processing...

Programmable filtering quickly starts to make sense because a lot of non-graphics applications don't require filtering at all, while graphics applications are using bilinear filter kernels to achieve higher level filtering. This can be optimized by adding more generic gather units, and using the gazillion FLOPS to perform the actual filtering. Try to think 7 years ahead from now.

ROPs are also still highly dedicated to graphics. And they're idle a lot of the time (either entirely or subfeatures of them). So you can get a better use of your silicon by replacing them with generic memory write units and performing all arithmetic in the shader cores, plus again you'd improve support for non-graphics applications.

And I really don't get all the panic about rasterization. It's a minor task compared to the work going on in the shaders.

Nick · Aug 20, 2009

paawl said:
I don't know much about game engines, but I'm not surprised that they're not a good fit to GPGPUs (or Cell). I believe that MS concluded that six to eight threads were all that were justifiable for the portion of a game engine that runs on the CPU. Tim Sweeney just confirmed that by saying that UE3 uses up to six threads, and I've read similar comments from other developers. (Ironically, mostly PS3 developers.)

Six to eight threads is all that is justifiable merely because CPUs don't support any more threads yet.

You can trade some scalability for performance. If your target CPU only has two cores, it makes no sense to spend a lot of time trying to split your workloads into many fine-grained tasks. You basically split your engine into independent systems that need to run every frame: rendering, physics, game logic, A.I., sound, etc. This will get you a nice speedup on a dual-core, but won't offer nearly as big a speedup on a quad-core. So when quad-cores became a reality, game developers had to look for better ways to put the extra power to good use. Valve came up with hybrid threading, back in 2006.

But that's not the end of it. When CPUs start supporting many more threads, it does make sense to look at fine-grained task scheduling. Research agrees that there is still a lot of task parallellism to exploit. It just comes at an overhead (run-time overhead, but also development effort) that is not yet fully justified. We're still in a transition period from single-core to multi-core. We haven't completely let go yet of the single-threaded way of thinking we were taught at school.

So it will take time for developers to get comfortable with developing for multi-core CPUs. But new libraries, frameworks, programming languages and other tools are appearing every day to assist them. For instance C++0x will add several features to better support concurrency. Furthermore, neither Intel or AMD can afford to not stay on track in the evolution toward more cores. All roadmaps include six and eight core CPUs for the enthusiast and mainstream market in a few years time.

So in conclusion the number of threads developers use depends on the number of cores available, not the other way around.

We prototyped in MATLAB and then coded the GPU portions in CUDA, and the CPU portions in C.

You mean you used MATLAB to conclude that a Tesla is "more than an order of magnitude faster" than a Core i7?

Oh dear.

3dilettante · Aug 20, 2009

Nick said:
Those high level stages still come at a cost. Else a GeForce 6800 would have been capable of geometry shading, so to speak. NVIDIA and ATI have to change their architecture to support Direct3D 11 and will have to do so again for Direct3D 12, etc. At some point all these added costs don't make sense any more.

Changing architectures is great for corporations. It's a great way to make people buy more stuff.
Up until the point that the market won't bear API transitions, the cost of making such transitions is incremental and an expected cost of doing business.

By making a fully programmable and flexible architecture they can lower the overall cost and support a wider range of workloads. In fact this evolution has been going on for a while, but very slowly. Shader Model 2.0 made all registers floating-point.

That's a step x86 CPUs haven't taken. Why don't we get on them for not making their architectures more flexible?

Programmable filtering quickly starts to make sense because a lot of non-graphics applications don't require filtering at all, while graphics applications are using bilinear filter kernels to achieve higher level filtering. This can be optimized by adding more generic gather units, and using the gazillion FLOPS to perform the actual filtering. Try to think 7 years ahead from now.

It's rather interesting that you'd call adding support for large-scale arbitrary gather and filtering with a gazillion FLOPS at full machine precision optimized.
Maybe it is from a software standpoint.

ROPs are also still highly dedicated to graphics. And they're idle a lot of the time (either entirely or subfeatures of them). So you can get a better use of your silicon by replacing them with generic memory write units and performing all arithmetic in the shader cores, plus again you'd improve support for non-graphics applications.

http://forum.beyond3d.com/showpost.php?p=1321966&postcount=1815
ROPs on GT200 take up 7.8% fo the die area.
The most we could expect from removing them would be an extra 7.8% die area for ALUs.
That would allow for ~20% more ALUs, if no other hardware needed to scale.
If we also include the TEX units needed to feed them data, we can expect perhaps ~16% more FLOPs.
Such a GPU would be pointless, as those ROPs were the ones that generated the bulk of the outgoing memory traffic for the graphics chip.
Adding the same amount of writeback capability for the ALUs is going to inflate their size.
I don't think it is unreasonable to expect maybe 10% extra FLOPs peak. I do not think workloads have hit the point that emulating the ROPs won't result in the consumption of far more.
Earlier estimates on this forum for AMD designs with 16 ROPs required something like an additional 1000 lanes for equivalent throughput.
(edit: scratch the 1000, that's for full ALU work for filtering, a separate hardware discussion)

Nick said:
Six to eight threads is all that is justifiable merely because CPUs don't support any more threads yet.

This is a chicken and egg problem, they don't support more threads because the software doesn't generally have them.

Furthermore, neither Intel or AMD can afford to not stay on track in the evolution toward more cores. All roadmaps include six and eight core CPUs for the enthusiast and mainstream market in a few years time.

Which roadmaps would those be?
I may not be up on AMD's latest, as they change so often, but at least through 2011 only the enthusiast SKUs go higher than 4 (probably because they are in line with the server options).
The mainstream desktop and mobile segments top out at a max of 4, and an additional *PU which shall not be named...

Intel's Westmere is another interesting thing.

rpg.314 · Aug 20, 2009

Nick said:
Programmable filtering quickly starts to make sense because a lot of non-graphics applications don't require filtering at all, while graphics applications are using bilinear filter kernels to achieve higher level filtering. This can be optimized by adding more generic gather units, and using the gazillion FLOPS to perform the actual filtering. Try to think 7 years ahead from now.

Hell, even lrb does it's texture filtering in hw.

trinibwoy · Aug 20, 2009

The man said 7 years, not next year.

MfA · Aug 20, 2009

Even in a fully programmable world we might still have texture units, units with the same ISA as the other units but with different caches and threading (transparent to the developer) which would simply be faster at texture sampling and filtering (and perhaps slower at other tasks).

trinibwoy · Aug 20, 2009

But in that case it would no longer be considered fixed function right? Just another processor in a superscalar architecture.

MfA · Aug 20, 2009

A CMP architecture, but yes ... it wouldn't be fixed function, it would be preferred function.

swaaye · Aug 20, 2009

So basically we're gonna go full circle.

- Started with purely software rendering on the CPU.
- Wasn't fast enough of course, back with 200 MHz CPUs, so we went ASIC 3D hardware that could do the math for some more fancy effects.
- GPUs slowly add programmability for fancier effects that we realize we can do with this dedicated hardware.
- Discovered that we need more and more programmability to do more complex rendering
- GPUs becoming capable of running even more types of code, but its primarily graphics related
- CPU vendors want in too so they slowly add instruction set features to do 3D better.
- CPU vendors start developing GPU-like CPUs, GPUs doing GPGPU
- The Great Era of Merging occurs

You know what probably won't go away? The wide range of performance available in PCs for gaming. Today you have IGPs at the low end and quad GPU setups at the top. Tomorrow I wonder if we'll have low end quad cores or some such at the bottom and 32 core (or whatever) monsters at the top.

dkanter · Aug 20, 2009

swaaye said:
So basically we're gonna go full circle.

- Started with purely software rendering on the CPU.
- Wasn't fast enough of course, back with 200 MHz CPUs, so we went ASIC 3D hardware that could do the math for some more fancy effects.
- GPUs slowly add programmability for fancier effects that we realize we can do with this dedicated hardware.
- Discovered that we need more and more programmability to do more complex rendering
- GPUs becoming capable of running even more types of code, but its primarily graphics related
- CPU vendors want in too so they slowly add instruction set features to do 3D better.
- CPU vendors start developing GPU-like CPUs, GPUs doing GPGPU
- The Great Era of Merging occurs

You know what probably won't go away? The wide range of performance available in PCs for gaming. Today you have IGPs at the low end and quad GPU setups at the top. Tomorrow I wonder if we'll have low end quad cores or some such at the bottom and 32 core (or whatever) monsters at the top.

Nobody uses SLI, and it's going to stay that way. The big problem with the merging is that on-die graphics will substantially reduce the market for discrete, since the performance will keep on going up.

Nothing at the high-end will ever be integrated, but there's an awful lot of discrete cards that could easily be integrated.

David

MfA · Aug 20, 2009

swaaye said:
- CPU vendors want in too so they slowly add instruction set features to do 3D better.

They do want in, but this isn't how they are getting in ... or at least this is a very misleading way of describing what Larrabee is doing.

swaaye · Aug 20, 2009

MfA said:
They do want in, but this isn't how they are getting in ... or at least this is a very misleading way of describing what Larrabee is doing.

Let me revise.

[consumer-level transistor budget of the time]

- Pure software rendering on the PC CPU (ignoring 2D GUI accelerators) [~ 1 million transistors]
- Not fast enough, limiting visual improvement options, so we make ASIC 3D hardware that is designed solely for the math involved with 3D [millions of transistors]
- CPU vendors want in resulting in continual SIMD improvements [~ten million transistors]
- GPUs get programmability to satisfy desire for more flexibility in utilizing the fast dedicated hardware. Almost exclusively limited to graphics processing [tens of millions of transistors]
- GPUs become flexible enough for more general code, no formalized GPGPU APIs yet [hundreds of millions of transistors]
- CPUs go multicore with more and more advanced SIMD features [hundreds of millions of transistors]
- CPU vendors working on GPU-like CPUs (ie Intel Larrabee) [~half billion transistors]
- GPUs continue to add programmability and transition to less specialized hardware. GPGPU APIs [~half billion transistors]
- Evolutions of CPUGPU and GPUCPU [billion transistors]
- ..
- The Great Era of Merging

Rootax · Aug 20, 2009

The only thing i have in mind reading the title is "We didn't listen !!" of Randy in an old south park episode.

Larabee, where are you ?

Laa-Yosh · Aug 20, 2009

I still don't think that fully software rendering would bring that many changes. Offline CG has always been software based and yet most renderers are using the same approaches with very little (and usually patented, I'm thinking Pixar...) changes. This is because we haven't really found more effective, versatile and practical approaches yet.

Leaving the standard rasterized triangles path would be good for fluids, smoke and fire and such stuff, but anything else would be problematic. Like, if epic would change UE5 to voxel rendering... how easy would it be to turn half the game studios around to completely change their art pipeline and re-educate people? Would the results outweigh the costs in time, money and effort? Hard to imagine...

Nick · Aug 20, 2009

3dilettante said:
Changing architectures is great for corporations. It's a great way to make people buy more stuff. Up until the point that the market won't bear API transitions, the cost of making such transitions is incremental and an expected cost of doing business.

It only takes one competitor to offer forward compatibility, and the others will have to follow or lose customers. Soon we're at the point where any limitation is just artificial, and consumers will no longer buy that crap.

But that doesn't mean that less hardware will be sold. They'll just have to compete entirely at performance. Enthusiasts will continue to buy new hardware every time it gets faster, while the mainstream market continues to buy new stuff at its own pace.

That's a step x86 CPUs haven't taken. Why don't we get on them for not making their architectures more flexible?

What? SSE registers can store single-precision floating-point, double-precision floating-point, 128-bit, 64-bit, 32-bit, 16-bit and 8-bit integer components.

It's rather interesting that you'd call adding support for large-scale arbitrary gather and filtering with a gazillion FLOPS at full machine precision optimized.
Maybe it is from a software standpoint.

ALU:TEX ratios continue to rise. At the same time, we're starting to see a clear need to access memory in a more generic way, not just reading textures. The post you're linking to suggests that GT200 dedicates only 13 % of die space to texture samplers. So it's not inconceivable that in the future they'll be replaced with generic gather units and the fraction of filtering is done in the shader cores.

Today's GPUs perform all calculations in 32-bit floating-point. Yet most of the time with graphics we're still working with colors that have 8-bit integer components. Is that optimized? So why would it be a big deal to generalize the texture samplers as well? They pretty much have to be capable of filtering floating-point textures at full precision anyway.

I do not think workloads have hit the point that emulating the ROPs won't result in the consumption of far more.

Again, large portions of the ROPs are idle at any given time. Linear to sRGB conversion and back, alpha blend, stenciling, depth test, anti-aliasing (color compression), write masking, etc. All these things can be turned off or on, but are mostly off. So while in the worst case emulation would be slower, the typical case can be just as efficient.

But where it really starts getting interesting is when the ROPs are a bottleneck. I've had software rendering beat hardware rendering hands down because of limited ROP throughput. Sure, you could add more ROPs, but then you're clearly eating into other components. With an ever growing variety of workloads, we're going to see this happen more often in the future unless they start making them more generic and unified.

And that's the point of the whole discussion. You're probably right that workloads haven't hit the point where emulation is more efficient. Yet! In less than ten years from now that situation will have changed drastically. Games will consist of maybe 50 % of graphics calculations that behave nicely, and 50 % of exotic algorithms that require highly generic computing cores. Optimized ROPs won't be worth a thing if the exotic workloads run horribly because you didn't invest in more generic components that support arbitrary data flows. An architecture that has to 'emulate' ROP functionality but is otherwise fully capable of running arbitrary workloads will be much faster overall.

This is a chicken and egg problem, they don't support more threads because the software doesn't generally have them.

I think the word "generally" is spot on. The problem isn't new software, it's legacy software. If we'd had CPUs with 32 cores by now, we'd have developers writing software with 32 threads. But they would have to sacrifice single-threaded performance to create such a CPU today. That's just not going to happen while there's still a majority of single-threaded software. But it doesn't mean the evolution toward more cores has stopped either. In ten years time the situation will be totally different. All software will be multi-threaded and scalable. So a 32 core CPU with a somewhat lower per-thread performance will win in many occasions from a 16 core CPU.

Like I said, we're still in the middle of a transition. It's going to take several more years but we'll have both chickens and eggs and it won't matter any more which one came first.

The End of The GPU Roadmap

Mintmaster

paawl

dkanter

nAo

Nutella Nutellae

Ailuros

Epsilon plus three

Nick

Nick

3dilettante

rpg.314

trinibwoy

Meh

MfA

trinibwoy

Meh

MfA

swaaye

Entirely Suboptimal

dkanter

MfA

swaaye

Entirely Suboptimal

Rootax

Laa-Yosh

I can has custom title?

Nick

Similar threads