Software/CPU-based 3D Rendering

Computer graphics rendering is no magic that needs endless iterations of new approaches. The list of all rendering problems can put on a single page and the list of unique concepts to solve them can be condensed to a couple lines.
Hundreds of new graphics rendering papers are released every year. New ground breaking techniques that change our graphics rendering pipelines radically get introduced almost every year. We are nowhere close to knowing how to craft a "perfect" rendering pipeline. I have been working 10 years in the field, and the progress hasn't slowed down, every year we get more new research. The arrival of DX11 compute shaders and new console generation that supports efficient general purpose GPU compute will result in more new reseach than ever occurring in the game graphics field. Mark my words: there will be huge new innovations for graphics rendering coming in the following 3-5 years. And programmable pipelines are the key that allows this innovation to happen.

Seemingly simple things such as real time shadow map rendering are still mostly based around hacks and shortcuts. There is no perfect way that suits every occasion, and is widely accepted as the best solution. All the main areas of shadow mapping are still heavily debated: shadow filtering / antialiasing (PCF, ESM, VSM, EVSM, CSM, screen space bilateral, etc), various projections / warping / logarithmic rendering, texel distribution / cascade splitting / tile splitting, acne prevention (various biasing techniques, back faces / middle, etc), resolution improvement techniques (storing edge info to pixels), etc, etc, etc. In addition to quality improvements, there's huge amount of open research in improving shadow map rendering performance. I have to remind that this list of open issues is incomplete, and discusses only simple hard edged shadows. Soft shadowing (from area light sources) is an area that has huge amount of new research going on as well (such as sparse voxel octree cone tracing based techniques). And translucent shadow research has started to become a hot topic lately as well.

Link to SIGGRAPH 2013 shadow mapping course slides:
http://www.realtimeshadows.com/sites/default/files/sig2013-course-hardshadows_0.pdf

And this is just one field of real time graphics rendering research (shadow mapping). Single deferred rendering become popular (~2009) there's been endless stream of papers about different lighting/material pipelines that are decoupled from the fixed function rasterization.

I don't think there's any real possibility for fixed function rendering hardware as long as we cannot even decide what is the most efficient way to store our geometry in the future (voxels, mathematical curves / subdivisions, triangles) and how we want to get his data set to the screen (rasterize/project "scatter", ray/path/cone trace "gather"). And this is just the opaque geometry. Dust/fog/water/particles and other volumetric substances require their own techniques as well (and their own fixed function hardware units), if we don't have programmable hardware to render them.
 
Last edited by a moderator:
I am well aware about the past and current render technique evolution, but this doesn't invalidate my argument. Most game devs didn't know anything about per pixel lighting and concurrent processing when the current generation of consoles arrived. Yet no groundbreaking advances have been accomplished in the last five years for those systems.

There are no major things to conquer anymore after it has been determined what concept is most suitable for endlessly scalable geometry, indirect lighting and non-solid volumes.

Anyway, I won't go on to defend my claim here that a fixed function hardware render pipeline is the future. I rather just do my work.
 
Yet no groundbreaking advances have been accomplished in the last five years for those systems.
You don't feel global illumination running on PS360 qualifies? Anyway, your statement feels a bit like a strawman, because regardless what anyone else says you could just say it's not groundbreaking. :)

Anyway, I won't go on to defend my claim here that a fixed function hardware render pipeline is the future.
I think the reason you've hit resistance on this point is because you haven't really explained why fixed function would be the future, when the trend has clearly been going entirely 180 degrees the other direction for pretty much a decade and a half now; towards MORE programmability and more general...ity. Not less.
 
Mark my words: there will be huge new innovations for graphics rendering coming in the following 3-5 years. And programmable pipelines are the key that allows this innovation to happen.
What would be your bet on the matter? What technical innovation (on the silicon side) do you expect to allow the break through(s) you are thinking about?
 
...and the biggest one, for the foreseeable future, is likely to be power.
Power is going to be dominated by data movement. NVIDIA estimates that on the 10 nm node, accessing a fixed-function unit which consumes virtually no power itself but is 10 mm away from the cores requires the same amount of energy as 23 double-precision FMA operations. It used to be the equivalent of only 6 of those.

With the right instruction set, you can really do a lot with 23 operations. Also note that the alternative 'solution' of bringing the fixed-function units closer to the cores would really just be an intermediate step toward bringing them into the cores as part of the instruction execution units.

So power consumption really isn't going to stop the convergence between CPU and GPU architectures.
 
Most game devs didn't know anything about per pixel lighting and concurrent processing when the current generation of consoles arrived. Yet no groundbreaking advances have been accomplished in the last five years for those systems.
No groundbreaking advances? Per pixel lighting was pure forward rendering during the early years of current gen consoles (2005-2008). First deferred rendering techniques started to became popular around year 2009 and changed how we perform per pixel lighting completely. Since then there's been several new groundbreaking techniques/variations of it invented every year. I intended to write a long post about various deferred rendering techniques, but I found this blog post that contained exactly what I want so say (lots of debate and 10+ links to research papers and further discussion regarding to that area): http://c0de517e.blogspot.fi/2011/01/mythbuster-deferred-rendering.html.

Physically based shading is another big topic that has started to become really popular in real time rendering research. Lots of new papers about it were released at Siggraph 2013, and there was a course as well (very good read to everyone interested in current state of real time lighting): http://blog.selfshadow.com/publications/s2013-shading-course.
There are no major things to conquer anymore after it has been determined what concept is most suitable for endlessly scalable geometry, indirect lighting and non-solid volumes.
"Determined" is a strange choice of words. I would pick the word "reseached" instead. None of the open issues in real time rendering are trivial to solve, and there's hundreds of open issues. Some of them will be solved during our lifetime, but some will not.
I won't go on to defend my claim here that a fixed function hardware render pipeline is the future.
I hope that pure fixed function rendering is not happening any time soon (in our lifetime). Programmable hardware allows much faster, more broad and more complete research. If only hardware manufacturers are allowed to do the research (fixed function hardware does everything), we lose 99%+ of our researchers (as programmers couldn't develop new graphics features anymore). That would delay the "perfect" hardware even further in the future.
What technical innovation (on the silicon side) do you expect to allow the break through(s) you are thinking about?
I am glad you asked... my answer is simple: Nothing in hardware side. New breakthroughs are going to be pure software.

We've already had GPUs that run compute shaders (thread synchronization, atomics, shared work memory) for the last three generations of PC hardware. We will finally have consoles that also support flexible GPU compute. As soon game rendering pipelines no longer need to be designed around DX9+ consoles and DX10 PCs, we can free ourselves from the shackles. Hopefully next gen consoles and games will sell exceptionally well, and we will get there as soon as possible :)
 
Power is going to be dominated by data movement. NVIDIA estimates that on the 10 nm node, accessing a fixed-function unit which consumes virtually no power itself but is 10 mm away from the cores requires the same amount of energy as 23 double-precision FMA operations. It used to be the equivalent of only 6 of those.

With the right instruction set, you can really do a lot with 23 operations. Also note that the alternative 'solution' of bringing the fixed-function units closer to the cores would really just be an intermediate step toward bringing them into the cores as part of the instruction execution units.

So power consumption really isn't going to stop the convergence between CPU and GPU architectures.
I have read all Nvidia's papers about exascale computing, and I must say that I agree with you (and Nvidia).

The more fixed function units you have, the more potential bottlenecks you have in your hardware, making it harder and harder to fully utilize all the transistors all the time. To combat the bottlenecks (of various rendering pipeline stages), every single kind of fixed function hardware needs to be over-provisioned, and this further increases the amount of dark silicon (powered down transistors that do nothing but cost chip area).

Powered down transistors do not spend that much power themselves, but as data movement is expensive, all the most important (and over-provisioned) fixed function units need to be close to each other. The more different kinds of units we have, the more area they take, the further away we must place them, thus the more power we lose in data movement. General purpose vector processors are quite area efficient and as they are general purpose, we do not need to over-provision them (just have enough computational capacity to max out the bandwidth links).

Update: Just wanted to make clear that I still believe that fixed function units that are widely used in various algorithms are good use of transistors. Examples: floating point math (including transcendentals), hardware cache logic (cache policies, coherence and prefetching) and texture filtering & fetching & address calculation (esp. for anisotropic filtering). Rasterization on the other hand could be implemented mostly by software (and that would allow many new algorithms).
 
Last edited by a moderator:
Power is going to be dominated by data movement. NVIDIA estimates that on the 10 nm node, accessing a fixed-function unit which consumes virtually no power itself but is 10 mm away from the cores requires the same amount of energy as 23 double-precision FMA operations. It used to be the equivalent of only 6 of those.

With the right instruction set, you can really do a lot with 23 operations. Also note that the alternative 'solution' of bringing the fixed-function units closer to the cores would really just be an intermediate step toward bringing them into the cores as part of the instruction execution units.

So power consumption really isn't going to stop the convergence between CPU and GPU architectures.

This is exactly the problem that large core processors face. You end up shuffling various bits of data from one end of the core to another, particularily when you add in things like OOO, bypassing, etc. Also, a large instruction set means more hardware means the data has to travel further to reach the correct execution unit for a given instruction.
 
Power is going to be dominated by data movement. NVIDIA estimates that on the 10 nm node, accessing a fixed-function unit which consumes virtually no power itself but is 10 mm away from the cores requires the same amount of energy as 23 double-precision FMA operations. It used to be the equivalent of only 6 of those.
Don't forget that you'd need to move data into and out of the registers as well. That power's also given in the document you linked.

This is exactly the problem that large core processors face. You end up shuffling various bits of data from one end of the core to another, particularily when you add in things like OOO, bypassing, etc. Also, a large instruction set means more hardware means the data has to travel further to reach the correct execution unit for a given instruction.

Even todays parallel processors make an effort to keep data as local as possible, for example in Kepler (and, even though it might not be a good example wrt power, Fermi), the SMX/SMK own a thread after it has been assigned to them until it's retired. That should theoretically save a lot of data travels.
 
The increasing programmability of streaming processors was primarily driven by the ability to maintain Moores law. However this is undoubtedly going to end very soon. Even Intel admits that this is going to happen within 5-7 years.

The mentioned limitations of power density due to the required data movements are an argument for fixed function rather than the opposite way around, simply because that's the nature of specialized logic blocks.

Why is the fixed function rasterizer still in use? Because you cannot compete by using general streaming processors. A lot of smart people tried it at Intel and they realized that you can't get even close to fixed function even with the hope to be able to compensate with better overall utilization, especially because of the reduced data locality.
 
Power is going to be dominated by data movement. NVIDIA estimates that on the 10 nm node, accessing a fixed-function unit which consumes virtually no power itself but is 10 mm away from the cores requires the same amount of energy as 23 double-precision FMA operations. It used to be the equivalent of only 6 of those.

10mm away? Perhaps use a different layout team?

Seriously though, as an example, performing 1080p video decode with dedicated HW takes ~= 12mW. By comparison, software decoders (VLC | Media Player) on a reasonably recent i7 appeared** to be drawing approximately 1 to 2W for 1080p.

Further, James McCombe from Caustic recently stated that dedicated hardware for a particular task was (IIRC) around 40x smaller than the using programmable logic. If you think that dedicated hardware is will be phased out soon, you may be mistaken.

**Estimated using http://software.intel.com/en-us/articles/intel-power-gadget-20
 
Bandwidth will always be a problem, but I think the latest FPGAs demonstrate that internal and external data bandwidth can be increased dramatically with modern IC manufacturing technologies.
Please elaborate. What FPGA technology will save ASICs from running into the bandwidth wall?

I'm sure we'll still see "dramatic" increases in external bandwidth in the next several years. But it just won't be enough in comparison to the even more dramatic increase in computing power. Indeed this has to be addressed by increasing internal bandwidth. This requires big caches, just like a CPU, and it requires high data access locality by having a low number of threads, also just like a CPU.
The slowed pace of performance/price scaling for IC manufacturing is already an established fact of reality right now and will naturally increase exponentially with time. What Intel says are just sweet unspecific words to keep investors and their personal pride calm.
What exactly do you mean by performance/price scaling? Are you talking about effective performance for the consumer, or the fundamental performance of the semiconductor process? And are you talking about the cost to the consumer, or the cost to build a fab for a smaller process node?

For GPUs the increase in effective performance has indeed slowed down. But that's not because of a fundamental issue with process scaling, that's a consequence of hitting several design ceilings that it didn't have to deal with before. For a long time they were able to double the performance with every generation by using a bigger die with a wider bus, higher frequencies, consuming more power, and catching up on process technology. Bigger dies didn't affect price too much because of the economies of scale. But lately things have become a lot tougher for them because they've played all their aces. TDP can no longer increase for a given form factor, they're already using the top process technology, there's little room for improvement left in GDDR technology without running into power issues, etc. Progress is still possible, but it's more constrained. Basically they're now playing by the same rules as everyone else.

CPUs on the on the other hand still have a lot potential for growth in throughput computing, beyond Moore's law. If Skylake supports AVX-512, the peak FP performance per core will have improved by a factor of 16 in about a decade's time. The entire core size didn't increase by nearly as much, so there will be room for 8 such cores on a mainstream chip. So they're becoming faster at a faster pace than GPUs now. Eventually they'll also just be limited by bandwidth. But that really just means that they can be unified, with specialized functions replacing the need for power efficient fixed-function units.
Why would you want to stall a gigantic serial processing pipeline with massive parallel streaming/processing? Right! It makes no sense at all.
What makes you think it would stall more? Graphics processing is no longer the "gigantic serial processing pipeline" that it used to be. GPUs have to deal with a lot of irregular behavior, and can stall or not achieve peak throughput for numerous reasons.
There is already barely any cost advantage per transistor from the 32/22 nm step.
Every new process node is initially more expensive than the previous one. But eventually they all come down in cost. 22 nm was particularly expensive due to the radically new 3D transistor technology. But it's offering great advantages and they're so confident in its abilities that they're making a leap of a node and a half to 14 nm. This also helps compensate for the increased cost per wafer.

10 nm will be tackled without EUV, and will likely introduce 450 mm wafers to reduce the cost per chip. EUV will then probably increase the cost again, but allows scaling down to 7 and 5 nm. Beyond that things get foggy, but it's been like that in the previous decade as well so it's likely for new technology to pop up with allows further scaling. More importantly, I'm pretty confident that we won't have to wait that long for the CPU and GPU to unify. Again, 2 TFLOPS is theoretically feasible for a unified device at 14 nm. That's a lot of power for a tiny mainstream chip, and sufficient for the majority of the market that currently makes due with integrated graphics.
 
You should stop thinking about the classical OGL/D3D rendering pipeline.
That's exactly what I'm doing! I'm claiming that graphics will become just like any other software, with total freedom. I'm thinking way beyond the classical rendering APIs.
Computer graphics rendering is no magic that needs endless iterations of new approaches.
Now you're just being blatantly ignorant. A lot of new graphics innovation is happening, and it's all essentially on the software side. When GPUs became programmable, developers ran with it and pushed it to its limits in ways the hardware manufacturers never foresaw. We're now at Shader Model 5.1, and it still isn't flexible enough. Developers have previously demanded direct access to integer texels to do their own filtering, they've made MSAA hardware redundant with shader-based approaches, and now they're trying hard to make rasterization programmable. And that's just rasterization graphics. Ray-tracing and general-purpose throughput computing also continuously diversify the workloads and demand more programmability instead of fixed-function hardware.
The list of all rendering problems can put on a single page and the list of unique concepts to solve them can be condensed to a couple lines.
Sure, there's only a handful of physics equations that define the behavior of light. So devising 'a' theoretical solution isn't that hard. It's putting it into practice that is the real challenge. Many new algorithms are developed every year, and integrating them into actual products takes a lot of refinement and tweaking.
Btw.: Tim Sweeney should STFU and realize that he never published/introduced a novel solution for interactive graphics.
Again, devising a theoretical solution is only a fraction of the work. It takes real crafsmanship and inginuity to put it into practice to power dozens of commercially successful products capable of running on a wide range of hardware.
He says so many obviously wrong things that my eyes are bleeding.
Such as?
 
That's exactly what I'm doing! I'm claiming that graphics will become just like any other software, with total freedom. I'm thinking way beyond the classical rendering APIs.
You just deny the reality that nothing is for free.
That you even claim that CPU scaling outperforms GPU scaling is ... delusional. The practical efficiency (and your proclaimed "freedom") of AVX units is just laughable compared to GPU streaming processors.
You think you can do better graphics with a CPU or a generic streaming processor compared to a GPU with an identical power budget? Then why don't you do it? Everything is available for that. But hurry, it will be much harder in the future.
 
Great job indeed!
As expected, performance/Watt is an order of magnitude away from an GPU. But still better than I thought it would be to be honest.
Note that it's only using up to SSE4 for now.

Jan Vlietinck wrote an AVX2 software renderer for Quake which is faster than the integrated GPU. The most amazing part of that is that it consists mostly of texture sampling, for which the GPU has fixed-function hardware.
 
Back
Top