Software/CPU-based 3D Rendering

Jan Vlietinck wrote an AVX2 software renderer for Quake which is faster than the integrated GPU. The most amazing part of that is that it consists mostly of texture sampling, for which the GPU has fixed-function hardware.

Yeah, for Quake 1 at high resolutions. That's a lot of very simple shading on a small amount of geometry yielding polygons with a huge number of pixels. Not exactly a great case study for software rendering with modern games. There's also no guarantee that the DX10 engine is as sophisticated as the software one.

Given the pixel to polygon ratio his overdraw elimination is very cheap compared to what you gain, so it's not hard to see that it'll win when you're bandwidth limited. And while the results are impressive w/bilinear that's still a far cry from all a full featured hardware TMU would do..
 
The more fixed function units you have, the more potential bottlenecks you have in your hardware, making it harder and harder to fully utilize all the transistors all the time.
Power (density) concerns already put a limit on how many transistors you can use at a time. In principle you could begin the design process by throwing in as many general purpose cores (initially without any specialised instructions) as fit your power budget, then add specialised instructions or more complex units for any operation you come across which is reasonably common and where specialised hardware can save more power than the cost of moving data to and from that unit.

Texture fetch is an interesting example because dedicated hardware can potentially reduce data movement across the chip.
 
This is exactly the problem that large core processors face. You end up shuffling various bits of data from one end of the core to another, particularily when you add in things like OOO, bypassing, etc. Also, a large instruction set means more hardware means the data has to travel further to reach the correct execution unit for a given instruction.
CPU cores are indeed adding more functionality every generation, but they're not getting larger in absolute terms, thanks to process shrinks. Multi-core is an obvious consequence of that. So the distance the data has to travel is really getting smaller, which saves power. Even ultra-mobile cores now use out-of-order execution, which shows that it can be both area and power efficient.

The layout of a modern CPU core also ensures that the data doesn't have to travel "from one end of the core to another". At least not within one clock cycle. The SIMD units are even split into lanes and it 'costs' an additional cycle for cross-lane operations. Which really is a small price to pay for the power efficiency benefit. And the same things apply to GPUs. A GK104 does not consist of 1536 tiny scalar cores. It consists of 8 large cores with 6 vector FMA units of 32 elements each. But it's quite power efficient because the biggest volume of data, namely the operands, only has to travel a fraction of the distance of the entire core every cycle. Still, NVIDIA concludes that "fetching operands costs more than computing on them". Hence the Echelon research project suggests using a tiny operand register file to keep the most recently used data very close to the ALUs. Which is essentially serving the same purpose of a bypass network on a CPU.

So both the CPU and GPU have to play a balancing act that starts to look increasingly more alike. Other research has also concluded that the difference between "large" and "small" cores is very minor, because you're just looking at it from a different level. With CPUs increasing their SIMD width, they start to look a lot like a GPU with many tiny and efficient 'cores' inside of them. Note that the out-of-order execution logic is not affected by this. It just gets cheaper with every process shrink.
 
Don't forget that you'd need to move data into and out of the registers as well. That power's also given in the document you linked.
First of all, that was just an example 'for illustration purposes'. Obviously a fixed-function unit doesn't consume zero power either so I simplified it for both ends. The important observation is that moving data is increasingly becoming more costly than computing on it. This wasn't the case before, so it's a paradigm shift that is bound to have vast consequences. If means that if you have to move data into a core anyway, you should keep it there as long as possible (which can include a few levels of cache). You can execute several more instructions for the cost of moving data out of the core and back in again, even when taking register accesses into account. And that number will continue to increase.

Secondly, the article concludes that the cost of the register file access can be significantly reduced by having a tiny operand register file. CPUs get most of their operands from a bypass network, which in theory is even more power efficient so it can be mostly neglected when comparing programmable versus fixed-function logic.
Even todays parallel processors make an effort to keep data as local as possible, for example in Kepler (and, even though it might not be a good example wrt power, Fermi), the SMX/SMK own a thread after it has been assigned to them until it's retired. That should theoretically save a lot of data travels.
Has any GPU ever been capable of thread migration (aside from Larrabee)?

Anyway, while pinning threads to cores benefits data locality, it can complicate balancing. The typical answer to that is to just have more granularity and juggle many threads per core, but that actually worsens data locality. GPUs not only run many threads, which generates an irregular data access pattern, they also have tiny caches. So the cache hit rate is relatively poor, which in turn means that data has to be fetched from a more distant location, which will soon become a dominant factor in power consumption...

It's not that this is an insurmountable problem, but each solution involves converging closer to a CPU architecture.
 
The increasing programmability of streaming processors was primarily driven by the ability to maintain Moores law.
Not really. It's a prerequisite, but not a driver. The hardware related drivers behind it were bandwidth and power consumption. With fixed-function graphics chips, multi-pass techniques were used to create more advanced effects than what single- or dual-texturing could produce. They could have scaled it to a thousand pixel pipelines by now, if it wasn't for the extreme RAM bandwidth this would have required. It's obviously far more efficient to have an on-die register file to store the temporary results of each arithmetic operation, then to write it to RAM and read it back later.

So in a sense programmability was always there. It just wasn't practical to have many passes. Also note that despite introducing 'real' programmability, Shader Model 1.0 was insanely limited, with just 2 temporary registers and a maximum of 4 texture lookups. So developers still resorted to multi-pass, only to run into bandwidth issues again. Every major advance in the "programmability" of graphics hardware, which is otherwise a pretty vague term in and of itself, has since introduced better control over data locality. Possibly the next step with DirectX 12 is to have reconfigurable shader stages, in another attempt to keep more data on-die instead of streaming it out to RAM and reading it back. To me this looks as primitive as Shader Model 1.0 again, except at the task level instead of the instruction level. Eventually this is bound to evolve into giving the developers complete access to things like task subdivision and scheduling, just like in a software renderer.
However this is undoubtedly going to end very soon. Even Intel admits that this is going to happen within 5-7 years.
Please point me to a senior Intel employee stating this.

Anyway, I think it's critical to note that the end of Moore's law doesn't equal the end of all semiconductor scaling. The 'law' simply states that the transistor count doubles every two years, but if this no longer holds then it might still double every three years, for an extensive period of time. Also, you don't necessarily need more transistors for more performance. Clock speed might go up with advanced materials, using the same minimal feature size. Or power might be reduced, which allows more transistors to be active. Or new design methodologies are developed which offer similar benefits. Stacking of dies is also a way to keep increasing the transistor count, and can bring logic and/or storage closer together to save power. Or we might use optics or something else which changes everything we thought we knew. Researchers have proven several times before that they won't throw in the towel easily just because they're hitting some physical limits. I find it mind boggling that Intel will still manufacture the 10 nm node using 193 nm lithography. Either way, when the billions of dollars would stop being poured into process technology, they will become available to other research. That's a lot of dough to find the next holy grail. There are lots of things left to be explored, and probably a ton of things yet to be discovered. So rest assured that steady progress will continue for a long time to come.

And as I've said multiple times now, we'll reach the point where CPU-GPU unification makes sense, well before the point where Moore's law no longer holds 100%. Having an integrated GPU adds cost, and if the CPU does an adequate job at graphics then that cost will be eliminated. If AVX-512 comes to Skylake then I don't think that time is much further away.
Why is the fixed function rasterizer still in use? Because you cannot compete by using general streaming processors. A lot of smart people tried it at Intel and they realized that you can't get even close to fixed function even with the hope to be able to compensate with better overall utilization, especially because of the reduced data locality.
Ah, yes, that must also be why we still use fixed-function pixel processing. No, wait...

Fixed-function is fixed-function. Whenever you want to do something beyond its capabilities, programmable logic quickly wins. This is happening to rasterization too. Researchers are proposing new algorithms every year which require more flexible rasterization. Even NVIDIA considers it an important research field. Also, you're grasping at straws when it comes to the performance impact of this. Rasterization takes only a fraction of computing power compared to what is required for shading. And then there's still the opportunity to add a handful of instructions to speed it up and make it more power efficient.
 
10mm away? Perhaps use a different layout team?
It wasn't intended as a realistic example. It just illustrates that data locality is becoming critical, and processing instructions is getting really cheap.
Seriously though, as an example, performing 1080p video decode with dedicated HW takes ~= 12mW. By comparison, software decoders (VLC | Media Player) on a reasonably recent i7 appeared** to be drawing approximately 1 to 2W for 1080p.
That would be a great example if not for the fact that GPUs now already mostly consist of programmable logic, which isn't going to diminish. The thing is, graphics is not like video decoding. It is far more diverse. And even that 1080p decoder you're talking about is utterly worthless when presented with a new codec. Instead the CPU can handle any codec, but it will also let you run your O.S., browse the web, play a game with advanced physics and A.I., predict the weather, etc. So the value of something should be determined over the entire range of things it supports. Else fixed-function hardware would always win and the very idea of having a CPU would be insane. But 1-2 Watt is perfectly acceptable for the markets this CPU targets. For the same reasons practically nobody buys a dedicated sound card any more. It costs more than what it offers in added value. GPUs are not very different any more. You could in theory design a much more dedicated chip which supports only the algorithms of one particular game (e.g. with a pipeline for each shader, optimized down to the last bit). But obviously that's not worth it. And besides, it would still just be bandwidth limited. So only some power efficiency advantage would remain, but instead we'd rather get hardware that runs a little hotter but supports many more applications.

It's only a matter of time before people no longer value an integrated GPU more than what it costs them, and perceive it as a limitation instead.
Further, James McCombe from Caustic recently stated that dedicated hardware for a particular task was (IIRC) around 40x smaller than the using programmable logic. If you think that dedicated hardware is will be phased out soon, you may be mistaken.
Sure, "for a particular task" you can get results like that. But again you have to look at the much bigger picture to evaluate the cost vs. value to see whether we can expect it in consumer hardware any time soon. Also note that some new x86 instructions are dozens of times faster than their previous alternative instruction sequences, but when used in complete applications the speedup is far more modest. Still, it shows that the same benefits of fixed-function hardware can be achieved using new instructions, in an architecture that will also let you run your O.S., browse the web, play a game...
 
You just deny the reality that nothing is for free.
I'm not denying that at all. I am extremely aware of the cost. But unifying the GPU's vertex and pixel processing also had a cost, and yet it still happened. Note that vertex and pixel processing started out completely differently, not even using a single chip. In comparison the CPU and GPU are much closer already (both physically and in terms of overlap between the kind of processing they can do). So while their unification won't come for free either, I still strongly believe it's going to happen.
That you even claim that CPU scaling outperforms GPU scaling is ... delusional.
Why? It's starting to become a pretty close race. GPUs can't keep up this historic scaling. They're largely at the mercy of process technology to be able to increase performance. And even then, effective performance won't scale as much as theoretical performance, due to the bandwidth wall. CPUs on the other hand have scaled from 128-bit to 256-bit SIMD, with FMA, and later 512-bit, all with a very minor increase in core transistor count, thus allowing the core count itself to go up as well. The GPU has been pushing the limits of compute density long before, and can't scale it as aggressively any more. The CPU's RAM bandwidth can still double with DDR4, which is also something that would be hard to imagine for the GPU (until they move RAM on-package, but that's already available for the CPU).
The practical efficiency (and your proclaimed "freedom") of AVX units is just laughable compared to GPU streaming processors.
The freedom that I am proclaiming stems from the CPU's ability to support any graphics feature, past or future. In contrast it is sheer impossible for a Shader Model 4.0 GPU to perform the new Shader Model 5.0 functionality (without involving the CPU, of course). The CPU can do it reasonably efficiently, plus anything else you can imagine. In fact given the computing resources, the performance can be quite exceptional in my opinion. With just 100 GFLOPS worth of computing power, you can compete with a 50 GFLOPS GPU. Most people would expect it to be far worse due to having to 'emulate' things. But with dynamic code generation and specialization the code becomes very streamlined to perform only exactly the operations required. So an advanced software renderer is really a graphics library implementation rather than an emulator. And with CPUs being hardware, and shaders being software, where's even the distinction between hardware and software rendering? It is fading rapidly.

I don't see what you consider laughable about AVX, unless you're talking about the way AVX1 was implemented on Sandy Bridge. It was severely bottlenecked by L/S bandwidth. But this was rectified with Haswell. It also added FMA support, which used to be pretty GPU exclusive, and 256-bit integer operations. AVX2 didn't get an efficient gather implementation though, but the optimization manual already promises improvements in future hardware (which possibly also supports AVX-512). So really the efficiency of AVX is catching up with the GPU. If you find it laughable now, you won't be laughing for much longer. Both architectures have a few pros and cons, but they mostly average out and there is continued convergence taking place.
You think you can do better graphics with a CPU or a generic streaming processor compared to a GPU with an identical power budget? Then why don't you do it? Everything is available for that. But hurry, it will be much harder in the future.
Depending on how you define it, the CPU can definitely do "better graphics" than the GPU. Also, the best performance improvement is the transition from the nonworking state to the working state, which is where software rendering currently has the most commercial success.

Why do you think it will be much harder in the future?
 
somehow reminds me of RISC vs CISC back then. having simple, generic units/instructions instead tons of specialized instructions :).
nowadays, intel and amd have CISC op-codes, but internally moved to more risc alike cores and nowadays, with more and more specialized instructions (gather, fma, blendps..) it siwngs towards a more CISC way. while on the other side e.g. ARM creates thumb, which looks somehow more CISC.

I think the same will continue with CPUs vs GPUs. They want to be power efficient, of course. But they don't want to be religious about "lets keep it all fixed function" or "everything has to be software", as both sides can be power inefficient. Larrabee somehow proved that you can't be that crazy.
although I agree that GPU and CPU will fuse together, I think more about those endless amount of vector units. the CPU will still keep its OoO, the op-code decoder and all the other front end. the gpu will keep the 'fetch' units, the rasterizer.
it's of course a waste of resource to have a rasterizer that is maybe just utilized to 10% overall, but at the same time it's a waste of time to utilize 10% of the vector units to do rasterization if a simple hard-wired rasterizer would solve it.

I think we've seen with the TMUs of nvidia, that seem to have moved the perspective division to the vector units, how it's gonna continue. Fixed units will be stripped to the bare minimum that is still 10x better in FF than in software, while moving some stuff to the vector units. in term of the rasterizer, I could imagin the rough rasterization of pixel blocks stays on the fixed site, while some 'early-z' fetch + fine rasterization might be moved to the vector units that would otherwise maybe stall, waiting for visible pixel. also the ordering of pixels is probably way more efficiently handled in hw than by software.

I think it's power in-efficient to remove fixed functions and at the same time add special instructions. even if those instructions are just supportive and would be way smaller than the fixed unit, it would be added to every vector unit, wasting transistors by not being used most of the time.

Also, fixed functions are not only the bare algorithm put in transistors, on GPUs they have usually nifty tricks to hide latency, queue and balances workloads. e.g compression for MSAA and splitting subsamples into individual memory layer, while doing gama correct blending.
 
I don't see the claimed movement towards fixed function hardware. In fact quite the opposite seems to be true, when you compare DX9 hardware to DX11 hardware. Programmability has increased dramatically and many fixed function units have been replaced by more general purpose units.

- We now have unified shaders instead of separate vertex and pixel shaders.
- Registers and ALU (math) are now IEEE compatible 32 bit floats (special cased 16/24 bit float processing are thing of the past). ALU is cheap compared to bandwidth.
- Custom texture caches and constant caches have been replaced with general purpose L1 caches (backed up with general purpose L2 caches).
- Lots of custom internal data buffers have been replaced with a single internal general purpose shared memory ("LDS"). As data caches/buffers tend to be quite big (lots of transistors), it's a waste to have a separate custom one for each fixed function unit (because things like geometry shaders and tessellation are not always active, and not all vertex shaders output maximum amount of interpolants, etc).
- All the listed improvements have allowed cost effective way to introduce new compute shader functionality. The general purpose on-chip shared memory allows efficient data sharing and synchronization between multiple threads of the same compute unit (data can be kept close to the execution units to reduce latency and to reduce memory traffic = reduce power usage). Unified shader cores have flexible memory input/output paths (previously vertex shaders were scatter only, and pixel shaders were gather only), filling the other requirement for general purpose processing using shader cores.
- Geometry amplification is possible with geometry shaders and tessellation. DX11 compatible tessellation uses unified programmable shader cores (just like all other shader types). Previous fixed function tessellators never become popular because of limited flexibility. Geometry amplification allows many new algorithms to be implemented, and it allows data to kept in on chip shared memory (instead of multipassing, thus reducing lots of data movement = reducing energy usage).
- GPUs can now spawn threads and control draw call primitive counts themselves (by indirect draw/dispatch calls). This creates lots of new possibilities for programmers. Kepler also can spawn new kernels from GPU (without CPU assistance).

We have also seen failures of fixed function hardware. Free 4xMSAA required lots of transistors on Xbox 360 (and lots of internal EDRAM bandwidth). A few years after the console launch deferred rendering was invented, and pretty much nobody uses MSAA hardware anymore in their console games. These transistors are just idling there doing nothing... That's always the risk of fixed function hardware. If it doesn't suit the task, it will be just dead silicon.
Texture fetch is an interesting example because dedicated hardware can potentially reduce data movement across the chip.
Agreed, fixed function texture filtering reduces traffic from L1 cache to registers. However it doesn't reduce traffic from memory to L2 to L1 (as all texels need to be in L1 for filtering), and that's the biggest distance the data needs to move (and thus consumes most of the data traffic energy).

For compressed (or 8888/11f-11f-10f) data, the savings in register bandwidth are not always that clear. After filtering and sRGB conversion, the fixed function unit must send a 4x32f value though the internal link (as one texel might be 234 and next one 235 and it might be zoomed so that one pixel covers the whole screen, and we still need a smooth gradient = we need lots of bits of precision). In this case the fixed function unit loses the point sampled 8888 case by 4x, and ties the 8888 bilinear filtered case. The fixed function unit wins the trilinear case by 2x (and anisotropic case by a larger margin). The BC7 compressed case is harder to analyze. It favors the fixed function hardware, except in cases where the CPU implementation is allowed to branch (3x3 area of each 4x4 block needs just one 128 bit register load). However CPU performance completely craps out if you add those incoherent branches to the code, so this task should favor fixed function hardware if we only consider the L1->register data movement (and even more so for trilinear and anisotropic filtering).

However the fixed function unit too needs to load it's texels from the same general purpose L1 cache (no current GPU has special purpose texture caches anymore). In order to have any gains in data movement energy efficiency, the fixed function unit needs to be closer to the L1 cache than the general purpose register files. If the L1 is large, this might be problematic (unless you bank it, and replicate the fixed function units for each bank... adding even more dark silicon).

The anisotropic filtering case is quite hard for CPU to solve efficiently. It must adapt the texel count (and access pattern) rapidly based on the surface slope. But branch mispredicts are in general around 20 cycles, and these kind of scenarios are very hard to predict properly. The new Haswell gather instruction has read mask (to skip some lane loads without branching), but I don't know how efficient it is compared to standard register loads.
 
no current GPU has special purpose texture caches anymore
NV still uses dedicated streaming cache for the TMUs (separate from the LDS/L1d combo) and in Kepler they actually enhanced it with direct access for the ALUs. On the other hand, AMD kept the GDS memory in GCN, though it's not exposed in any platform API.
 
Nick, added to future instruction sets, which especific benefits do you see in embeded ram (like Iris Pro) for software renderers?
It could be very valuable for implicitly storing the entire color and depth buffer, as well as all the visible texture data for the current scene, without having to resort to deferred tile based rendering. This seems especially useful for AVX-512, which would demand a lot of bandwidth. So it could both offer a speedup and keep things straightforward for developers. And it saves power too!

I consider it a natural and generic extension of the L1/L2/L3 cache hierarchy with a fourth level. There is a large gap between the ~8 MB L3 and ~8 GB RAM. Many working sets are bigger than 8 MB, especially in graphics, but probably also for a lot of other applications.
 
The new Haswell 128 MB L4 cache is definitely a very interesting addition for software rendering. I would expect it to speed up CPU based SVO rendering algorithms a lot (as the data sets and access patterns match quite well). Too bad the L4 is not available for the desktop/workstation processors yet. 8 core Haswell based Xeon with 128 MB L4 would be a perfect test setup for software rendering development.
 
On the other hand, AMD kept the GDS memory in GCN, though it's not exposed in any platform API.
Yes, GDS is still there in GCN (but more generically used by various means) and ROP caches are still separate, but many other fixed function caches/buffers were removed and replaced with general purpose ones. The direction is clear, more general purpose units and structures, and less fixed function ones. Gradually GPUs are becoming more and more general purpose.

GCN whitepaper:
http://www.amd.com/us/Documents/GCN_Architecture_whitepaper.pdf

quote from the GCN whitepaper:
"The most significant changes in GCN were unquestionably in the cache hierarchy, which has morphed from a tightly focused
graphics design into a high performance and programmable hierarchy that is both well suited for general purpose applications and ready for integration with
x86 processors. The changes start within the Compute Unit, but extend throughout the entire system.

The GCN memory hierarchy is a unified read/write caching system with virtual memory support and excellent atomic operation performance. This represents a
significant improvement over the separate read caching and write buffering paths used in previous generations. Vector memory instructions support variable
granularity for addresses and data, ranging from 32-bit data to 128-bit pixel quads."
 
I would expect it to speed up CPU based SVO rendering algorithms a lot (as the data sets and access patterns match quite well).
isn't it just twice the bandwidth of the 4channel speed we have on intel's ...-E cpus?

imo, it sounds rather like a power saver (by allowing cheaper/slower memory) than something that would benefit on desktops.


does actually someone have some data on transistor count or die size of modern TMUs, Rasterizers or other Fixed Function units? couldn't find anything useful yet.
 
isn't it just twice the bandwidth of the 4channel speed we have on intel's ...-E cpus?

imo, it sounds rather like a power saver (by allowing cheaper/slower memory) than something that would benefit on desktops.
Yes, the memory bandwidth of the Xeons/-E models is doubled. That's another nice gain compared to standard consumer models.

L4 cache cuts the latency roughly to half compared to full cache miss (DDR3 memory access) according to Anandtech benchmarks, so it definitely helps improving performance as well (in addition to energy savings).
 
L4 cache cuts the latency roughly to half compared to full cache miss (DDR3 memory access) according to Anandtech benchmarks, so it definitely helps improving performance as well (in addition to energy savings).
thx for remembering me of those... http://www.anandtech.com/show/6993/intel-iris-pro-5200-graphics-review-core-i74950hq-tested/3 ... seems to be not even 50GB/s in benchmarks. I think I've seen some of the sandy bridge -e scoring around 40GB/s, but on the whole memory range.
half the latency is still nice, tho now it looks even more like an energy saver for laptops than a performance boost on high end.
 
Yeah, for Quake 1 at high resolutions. That's a lot of very simple shading on a small amount of geometry yielding polygons with a huge number of pixels. Not exactly a great case study for software rendering with modern games. There's also no guarantee that the DX10 engine is as sophisticated as the software one.

Given the pixel to polygon ratio his overdraw elimination is very cheap compared to what you gain, so it's not hard to see that it'll win when you're bandwidth limited. And while the results are impressive w/bilinear that's still a far cry from all a full featured hardware TMU would do..
I don't think you're fully grasping the magnitude of what is achieved here. Quake pretty much spawned the existence of dedicated 3D graphics hardware for consumers. Now things have nearly come full circle.

The value of having an integrated GPU is quickly disappearing. People who stick with the integrated GPU don't care much about graphics anyway, aside perhaps from running Angry Birds. What would they need fixed-function tessellation hardware for? It's just cheaper to implement that in software when and if it's needed. AVX-512 should even deliver quite adequate performance, and can also be used for a gazillion other high DLP algorithms. Even beyond 3D.

The CPU also allows for a more 'intelligent' approach versus the brute force philosophy of the GPU. Overdraw elimination is just one example. Recompiling shaders with specialization for different usage scenarios is another one. Maxwell is even supposed to get CPU cores on-die, quite possibly in an attempt to bring the same kind of intelligence to the GPU. But from a compute density and power consumption point of view, it just means more convergence.
 
So nick you predicting a future with cpu's and discreet gfx cards
Hardcore gamers were the first willing to pay for discrete graphics cards, and will be the last to keep buying them when the CPU and GPU start unifying from the low-end market up. But in the very long run I see no future for anything discrete. The writing is on the wall. The new generation of consoles already have the CPU and GPU on the same die. It offers a great deal of new possibilities. These consoles and the games developed for them may even act as a catalyst for adding powerful compute capabilities to desktop CPUs and making discrete GPUs more generic.
 
Note that vertex and pixel processing started out completely differently, not even using a single chip. In comparison the CPU and GPU are much closer already (both physically and in terms of overlap between the kind of processing they can do). So while their unification won't come for free either, I still strongly believe it's going to happen.
I think vertex and pixel unification was free as it meant chip architects no longer had to over provision the separate units to account for the cases in which you are completely bottlenecked by one or the other.

- Geometry amplification is possible with geometry shaders and tessellation. DX11 compatible tessellation uses unified programmable shader cores (just like all other shader types). Previous fixed function tessellators never become popular because of limited flexibility. Geometry amplification allows many new algorithms to be implemented, and it allows data to kept in on chip shared memory (instead of multipassing, thus reducing lots of data movement = reducing energy usage).
You make many good points so I won't focus on those, but I disagree that previous fixed function tessellators didn't catch on because of limited flexibility. DX11 tessellation doesn't have much more flexibility outside of the hull shader and most uses of DX11 tessellation so far haven't taken advantage of the additional flexibility.

The multi-pass nature of previous methods might have been an issue for some people, but I view that as a potential performance issue not a case of limited flexibility.
 
Back
Top