Software/CPU-based 3D Rendering

Discussion in 'Rendering Technology and APIs' started by 3D_world, Oct 28, 2012.

  1. Exophase

    Veteran

    Joined:
    Mar 25, 2010
    Messages:
    2,406
    Likes Received:
    430
    Location:
    Cleveland, OH
    Yeah, for Quake 1 at high resolutions. That's a lot of very simple shading on a small amount of geometry yielding polygons with a huge number of pixels. Not exactly a great case study for software rendering with modern games. There's also no guarantee that the DX10 engine is as sophisticated as the software one.

    Given the pixel to polygon ratio his overdraw elimination is very cheap compared to what you gain, so it's not hard to see that it'll win when you're bandwidth limited. And while the results are impressive w/bilinear that's still a far cry from all a full featured hardware TMU would do..
     
  2. Xmas

    Xmas Porous
    Veteran Subscriber

    Joined:
    Feb 6, 2002
    Messages:
    3,344
    Likes Received:
    176
    Location:
    On the path to wisdom
    Power (density) concerns already put a limit on how many transistors you can use at a time. In principle you could begin the design process by throwing in as many general purpose cores (initially without any specialised instructions) as fit your power budget, then add specialised instructions or more complex units for any operation you come across which is reasonably common and where specialised hardware can save more power than the cost of moving data to and from that unit.

    Texture fetch is an interesting example because dedicated hardware can potentially reduce data movement across the chip.
     
  3. Nick

    Veteran

    Joined:
    Jan 7, 2003
    Messages:
    1,881
    Likes Received:
    17
    Location:
    Montreal, Quebec
    CPU cores are indeed adding more functionality every generation, but they're not getting larger in absolute terms, thanks to process shrinks. Multi-core is an obvious consequence of that. So the distance the data has to travel is really getting smaller, which saves power. Even ultra-mobile cores now use out-of-order execution, which shows that it can be both area and power efficient.

    The layout of a modern CPU core also ensures that the data doesn't have to travel "from one end of the core to another". At least not within one clock cycle. The SIMD units are even split into lanes and it 'costs' an additional cycle for cross-lane operations. Which really is a small price to pay for the power efficiency benefit. And the same things apply to GPUs. A GK104 does not consist of 1536 tiny scalar cores. It consists of 8 large cores with 6 vector FMA units of 32 elements each. But it's quite power efficient because the biggest volume of data, namely the operands, only has to travel a fraction of the distance of the entire core every cycle. Still, NVIDIA concludes that "fetching operands costs more than computing on them". Hence the Echelon research project suggests using a tiny operand register file to keep the most recently used data very close to the ALUs. Which is essentially serving the same purpose of a bypass network on a CPU.

    So both the CPU and GPU have to play a balancing act that starts to look increasingly more alike. Other research has also concluded that the difference between "large" and "small" cores is very minor, because you're just looking at it from a different level. With CPUs increasing their SIMD width, they start to look a lot like a GPU with many tiny and efficient 'cores' inside of them. Note that the out-of-order execution logic is not affected by this. It just gets cheaper with every process shrink.
     
  4. Nick

    Veteran

    Joined:
    Jan 7, 2003
    Messages:
    1,881
    Likes Received:
    17
    Location:
    Montreal, Quebec
    First of all, that was just an example 'for illustration purposes'. Obviously a fixed-function unit doesn't consume zero power either so I simplified it for both ends. The important observation is that moving data is increasingly becoming more costly than computing on it. This wasn't the case before, so it's a paradigm shift that is bound to have vast consequences. If means that if you have to move data into a core anyway, you should keep it there as long as possible (which can include a few levels of cache). You can execute several more instructions for the cost of moving data out of the core and back in again, even when taking register accesses into account. And that number will continue to increase.

    Secondly, the article concludes that the cost of the register file access can be significantly reduced by having a tiny operand register file. CPUs get most of their operands from a bypass network, which in theory is even more power efficient so it can be mostly neglected when comparing programmable versus fixed-function logic.
    Has any GPU ever been capable of thread migration (aside from Larrabee)?

    Anyway, while pinning threads to cores benefits data locality, it can complicate balancing. The typical answer to that is to just have more granularity and juggle many threads per core, but that actually worsens data locality. GPUs not only run many threads, which generates an irregular data access pattern, they also have tiny caches. So the cache hit rate is relatively poor, which in turn means that data has to be fetched from a more distant location, which will soon become a dominant factor in power consumption...

    It's not that this is an insurmountable problem, but each solution involves converging closer to a CPU architecture.
     
  5. Nick

    Veteran

    Joined:
    Jan 7, 2003
    Messages:
    1,881
    Likes Received:
    17
    Location:
    Montreal, Quebec
    Not really. It's a prerequisite, but not a driver. The hardware related drivers behind it were bandwidth and power consumption. With fixed-function graphics chips, multi-pass techniques were used to create more advanced effects than what single- or dual-texturing could produce. They could have scaled it to a thousand pixel pipelines by now, if it wasn't for the extreme RAM bandwidth this would have required. It's obviously far more efficient to have an on-die register file to store the temporary results of each arithmetic operation, then to write it to RAM and read it back later.

    So in a sense programmability was always there. It just wasn't practical to have many passes. Also note that despite introducing 'real' programmability, Shader Model 1.0 was insanely limited, with just 2 temporary registers and a maximum of 4 texture lookups. So developers still resorted to multi-pass, only to run into bandwidth issues again. Every major advance in the "programmability" of graphics hardware, which is otherwise a pretty vague term in and of itself, has since introduced better control over data locality. Possibly the next step with DirectX 12 is to have reconfigurable shader stages, in another attempt to keep more data on-die instead of streaming it out to RAM and reading it back. To me this looks as primitive as Shader Model 1.0 again, except at the task level instead of the instruction level. Eventually this is bound to evolve into giving the developers complete access to things like task subdivision and scheduling, just like in a software renderer.
    Please point me to a senior Intel employee stating this.

    Anyway, I think it's critical to note that the end of Moore's law doesn't equal the end of all semiconductor scaling. The 'law' simply states that the transistor count doubles every two years, but if this no longer holds then it might still double every three years, for an extensive period of time. Also, you don't necessarily need more transistors for more performance. Clock speed might go up with advanced materials, using the same minimal feature size. Or power might be reduced, which allows more transistors to be active. Or new design methodologies are developed which offer similar benefits. Stacking of dies is also a way to keep increasing the transistor count, and can bring logic and/or storage closer together to save power. Or we might use optics or something else which changes everything we thought we knew. Researchers have proven several times before that they won't throw in the towel easily just because they're hitting some physical limits. I find it mind boggling that Intel will still manufacture the 10 nm node using 193 nm lithography. Either way, when the billions of dollars would stop being poured into process technology, they will become available to other research. That's a lot of dough to find the next holy grail. There are lots of things left to be explored, and probably a ton of things yet to be discovered. So rest assured that steady progress will continue for a long time to come.

    And as I've said multiple times now, we'll reach the point where CPU-GPU unification makes sense, well before the point where Moore's law no longer holds 100%. Having an integrated GPU adds cost, and if the CPU does an adequate job at graphics then that cost will be eliminated. If AVX-512 comes to Skylake then I don't think that time is much further away.
    Ah, yes, that must also be why we still use fixed-function pixel processing. No, wait...

    Fixed-function is fixed-function. Whenever you want to do something beyond its capabilities, programmable logic quickly wins. This is happening to rasterization too. Researchers are proposing new algorithms every year which require more flexible rasterization. Even NVIDIA considers it an important research field. Also, you're grasping at straws when it comes to the performance impact of this. Rasterization takes only a fraction of computing power compared to what is required for shading. And then there's still the opportunity to add a handful of instructions to speed it up and make it more power efficient.
     
  6. Nick

    Veteran

    Joined:
    Jan 7, 2003
    Messages:
    1,881
    Likes Received:
    17
    Location:
    Montreal, Quebec
    It wasn't intended as a realistic example. It just illustrates that data locality is becoming critical, and processing instructions is getting really cheap.
    That would be a great example if not for the fact that GPUs now already mostly consist of programmable logic, which isn't going to diminish. The thing is, graphics is not like video decoding. It is far more diverse. And even that 1080p decoder you're talking about is utterly worthless when presented with a new codec. Instead the CPU can handle any codec, but it will also let you run your O.S., browse the web, play a game with advanced physics and A.I., predict the weather, etc. So the value of something should be determined over the entire range of things it supports. Else fixed-function hardware would always win and the very idea of having a CPU would be insane. But 1-2 Watt is perfectly acceptable for the markets this CPU targets. For the same reasons practically nobody buys a dedicated sound card any more. It costs more than what it offers in added value. GPUs are not very different any more. You could in theory design a much more dedicated chip which supports only the algorithms of one particular game (e.g. with a pipeline for each shader, optimized down to the last bit). But obviously that's not worth it. And besides, it would still just be bandwidth limited. So only some power efficiency advantage would remain, but instead we'd rather get hardware that runs a little hotter but supports many more applications.

    It's only a matter of time before people no longer value an integrated GPU more than what it costs them, and perceive it as a limitation instead.
    Sure, "for a particular task" you can get results like that. But again you have to look at the much bigger picture to evaluate the cost vs. value to see whether we can expect it in consumer hardware any time soon. Also note that some new x86 instructions are dozens of times faster than their previous alternative instruction sequences, but when used in complete applications the speedup is far more modest. Still, it shows that the same benefits of fixed-function hardware can be achieved using new instructions, in an architecture that will also let you run your O.S., browse the web, play a game...
     
  7. Nick

    Veteran

    Joined:
    Jan 7, 2003
    Messages:
    1,881
    Likes Received:
    17
    Location:
    Montreal, Quebec
    I'm not denying that at all. I am extremely aware of the cost. But unifying the GPU's vertex and pixel processing also had a cost, and yet it still happened. Note that vertex and pixel processing started out completely differently, not even using a single chip. In comparison the CPU and GPU are much closer already (both physically and in terms of overlap between the kind of processing they can do). So while their unification won't come for free either, I still strongly believe it's going to happen.
    Why? It's starting to become a pretty close race. GPUs can't keep up this historic scaling. They're largely at the mercy of process technology to be able to increase performance. And even then, effective performance won't scale as much as theoretical performance, due to the bandwidth wall. CPUs on the other hand have scaled from 128-bit to 256-bit SIMD, with FMA, and later 512-bit, all with a very minor increase in core transistor count, thus allowing the core count itself to go up as well. The GPU has been pushing the limits of compute density long before, and can't scale it as aggressively any more. The CPU's RAM bandwidth can still double with DDR4, which is also something that would be hard to imagine for the GPU (until they move RAM on-package, but that's already available for the CPU).
    The freedom that I am proclaiming stems from the CPU's ability to support any graphics feature, past or future. In contrast it is sheer impossible for a Shader Model 4.0 GPU to perform the new Shader Model 5.0 functionality (without involving the CPU, of course). The CPU can do it reasonably efficiently, plus anything else you can imagine. In fact given the computing resources, the performance can be quite exceptional in my opinion. With just 100 GFLOPS worth of computing power, you can compete with a 50 GFLOPS GPU. Most people would expect it to be far worse due to having to 'emulate' things. But with dynamic code generation and specialization the code becomes very streamlined to perform only exactly the operations required. So an advanced software renderer is really a graphics library implementation rather than an emulator. And with CPUs being hardware, and shaders being software, where's even the distinction between hardware and software rendering? It is fading rapidly.

    I don't see what you consider laughable about AVX, unless you're talking about the way AVX1 was implemented on Sandy Bridge. It was severely bottlenecked by L/S bandwidth. But this was rectified with Haswell. It also added FMA support, which used to be pretty GPU exclusive, and 256-bit integer operations. AVX2 didn't get an efficient gather implementation though, but the optimization manual already promises improvements in future hardware (which possibly also supports AVX-512). So really the efficiency of AVX is catching up with the GPU. If you find it laughable now, you won't be laughing for much longer. Both architectures have a few pros and cons, but they mostly average out and there is continued convergence taking place.
    Depending on how you define it, the CPU can definitely do "better graphics" than the GPU. Also, the best performance improvement is the transition from the nonworking state to the working state, which is where software rendering currently has the most commercial success.

    Why do you think it will be much harder in the future?
     
  8. rapso

    Newcomer

    Joined:
    May 6, 2008
    Messages:
    215
    Likes Received:
    28
    somehow reminds me of RISC vs CISC back then. having simple, generic units/instructions instead tons of specialized instructions :).
    nowadays, intel and amd have CISC op-codes, but internally moved to more risc alike cores and nowadays, with more and more specialized instructions (gather, fma, blendps..) it siwngs towards a more CISC way. while on the other side e.g. ARM creates thumb, which looks somehow more CISC.

    I think the same will continue with CPUs vs GPUs. They want to be power efficient, of course. But they don't want to be religious about "lets keep it all fixed function" or "everything has to be software", as both sides can be power inefficient. Larrabee somehow proved that you can't be that crazy.
    although I agree that GPU and CPU will fuse together, I think more about those endless amount of vector units. the CPU will still keep its OoO, the op-code decoder and all the other front end. the gpu will keep the 'fetch' units, the rasterizer.
    it's of course a waste of resource to have a rasterizer that is maybe just utilized to 10% overall, but at the same time it's a waste of time to utilize 10% of the vector units to do rasterization if a simple hard-wired rasterizer would solve it.

    I think we've seen with the TMUs of nvidia, that seem to have moved the perspective division to the vector units, how it's gonna continue. Fixed units will be stripped to the bare minimum that is still 10x better in FF than in software, while moving some stuff to the vector units. in term of the rasterizer, I could imagin the rough rasterization of pixel blocks stays on the fixed site, while some 'early-z' fetch + fine rasterization might be moved to the vector units that would otherwise maybe stall, waiting for visible pixel. also the ordering of pixels is probably way more efficiently handled in hw than by software.

    I think it's power in-efficient to remove fixed functions and at the same time add special instructions. even if those instructions are just supportive and would be way smaller than the fixed unit, it would be added to every vector unit, wasting transistors by not being used most of the time.

    Also, fixed functions are not only the bare algorithm put in transistors, on GPUs they have usually nifty tricks to hide latency, queue and balances workloads. e.g compression for MSAA and splitting subsamples into individual memory layer, while doing gama correct blending.
     
  9. sebbbi

    Veteran

    Joined:
    Nov 14, 2007
    Messages:
    2,924
    Likes Received:
    5,296
    Location:
    Helsinki, Finland
    I don't see the claimed movement towards fixed function hardware. In fact quite the opposite seems to be true, when you compare DX9 hardware to DX11 hardware. Programmability has increased dramatically and many fixed function units have been replaced by more general purpose units.

    - We now have unified shaders instead of separate vertex and pixel shaders.
    - Registers and ALU (math) are now IEEE compatible 32 bit floats (special cased 16/24 bit float processing are thing of the past). ALU is cheap compared to bandwidth.
    - Custom texture caches and constant caches have been replaced with general purpose L1 caches (backed up with general purpose L2 caches).
    - Lots of custom internal data buffers have been replaced with a single internal general purpose shared memory ("LDS"). As data caches/buffers tend to be quite big (lots of transistors), it's a waste to have a separate custom one for each fixed function unit (because things like geometry shaders and tessellation are not always active, and not all vertex shaders output maximum amount of interpolants, etc).
    - All the listed improvements have allowed cost effective way to introduce new compute shader functionality. The general purpose on-chip shared memory allows efficient data sharing and synchronization between multiple threads of the same compute unit (data can be kept close to the execution units to reduce latency and to reduce memory traffic = reduce power usage). Unified shader cores have flexible memory input/output paths (previously vertex shaders were scatter only, and pixel shaders were gather only), filling the other requirement for general purpose processing using shader cores.
    - Geometry amplification is possible with geometry shaders and tessellation. DX11 compatible tessellation uses unified programmable shader cores (just like all other shader types). Previous fixed function tessellators never become popular because of limited flexibility. Geometry amplification allows many new algorithms to be implemented, and it allows data to kept in on chip shared memory (instead of multipassing, thus reducing lots of data movement = reducing energy usage).
    - GPUs can now spawn threads and control draw call primitive counts themselves (by indirect draw/dispatch calls). This creates lots of new possibilities for programmers. Kepler also can spawn new kernels from GPU (without CPU assistance).

    We have also seen failures of fixed function hardware. Free 4xMSAA required lots of transistors on Xbox 360 (and lots of internal EDRAM bandwidth). A few years after the console launch deferred rendering was invented, and pretty much nobody uses MSAA hardware anymore in their console games. These transistors are just idling there doing nothing... That's always the risk of fixed function hardware. If it doesn't suit the task, it will be just dead silicon.
    Agreed, fixed function texture filtering reduces traffic from L1 cache to registers. However it doesn't reduce traffic from memory to L2 to L1 (as all texels need to be in L1 for filtering), and that's the biggest distance the data needs to move (and thus consumes most of the data traffic energy).

    For compressed (or 8888/11f-11f-10f) data, the savings in register bandwidth are not always that clear. After filtering and sRGB conversion, the fixed function unit must send a 4x32f value though the internal link (as one texel might be 234 and next one 235 and it might be zoomed so that one pixel covers the whole screen, and we still need a smooth gradient = we need lots of bits of precision). In this case the fixed function unit loses the point sampled 8888 case by 4x, and ties the 8888 bilinear filtered case. The fixed function unit wins the trilinear case by 2x (and anisotropic case by a larger margin). The BC7 compressed case is harder to analyze. It favors the fixed function hardware, except in cases where the CPU implementation is allowed to branch (3x3 area of each 4x4 block needs just one 128 bit register load). However CPU performance completely craps out if you add those incoherent branches to the code, so this task should favor fixed function hardware if we only consider the L1->register data movement (and even more so for trilinear and anisotropic filtering).

    However the fixed function unit too needs to load it's texels from the same general purpose L1 cache (no current GPU has special purpose texture caches anymore). In order to have any gains in data movement energy efficiency, the fixed function unit needs to be closer to the L1 cache than the general purpose register files. If the L1 is large, this might be problematic (unless you bank it, and replicate the fixed function units for each bank... adding even more dark silicon).

    The anisotropic filtering case is quite hard for CPU to solve efficiently. It must adapt the texel count (and access pattern) rapidly based on the surface slope. But branch mispredicts are in general around 20 cycles, and these kind of scenarios are very hard to predict properly. The new Haswell gather instruction has read mask (to skip some lane loads without branching), but I don't know how efficient it is compared to standard register loads.
     
  10. fellix

    Veteran

    Joined:
    Dec 4, 2004
    Messages:
    3,552
    Likes Received:
    514
    Location:
    Varna, Bulgaria
    NV still uses dedicated streaming cache for the TMUs (separate from the LDS/L1d combo) and in Kepler they actually enhanced it with direct access for the ALUs. On the other hand, AMD kept the GDS memory in GCN, though it's not exposed in any platform API.
     
  11. Nick

    Veteran

    Joined:
    Jan 7, 2003
    Messages:
    1,881
    Likes Received:
    17
    Location:
    Montreal, Quebec
    It could be very valuable for implicitly storing the entire color and depth buffer, as well as all the visible texture data for the current scene, without having to resort to deferred tile based rendering. This seems especially useful for AVX-512, which would demand a lot of bandwidth. So it could both offer a speedup and keep things straightforward for developers. And it saves power too!

    I consider it a natural and generic extension of the L1/L2/L3 cache hierarchy with a fourth level. There is a large gap between the ~8 MB L3 and ~8 GB RAM. Many working sets are bigger than 8 MB, especially in graphics, but probably also for a lot of other applications.
     
  12. sebbbi

    Veteran

    Joined:
    Nov 14, 2007
    Messages:
    2,924
    Likes Received:
    5,296
    Location:
    Helsinki, Finland
    The new Haswell 128 MB L4 cache is definitely a very interesting addition for software rendering. I would expect it to speed up CPU based SVO rendering algorithms a lot (as the data sets and access patterns match quite well). Too bad the L4 is not available for the desktop/workstation processors yet. 8 core Haswell based Xeon with 128 MB L4 would be a perfect test setup for software rendering development.
     
  13. sebbbi

    Veteran

    Joined:
    Nov 14, 2007
    Messages:
    2,924
    Likes Received:
    5,296
    Location:
    Helsinki, Finland
    Yes, GDS is still there in GCN (but more generically used by various means) and ROP caches are still separate, but many other fixed function caches/buffers were removed and replaced with general purpose ones. The direction is clear, more general purpose units and structures, and less fixed function ones. Gradually GPUs are becoming more and more general purpose.

    GCN whitepaper:
    http://www.amd.com/us/Documents/GCN_Architecture_whitepaper.pdf

    quote from the GCN whitepaper:
    "The most significant changes in GCN were unquestionably in the cache hierarchy, which has morphed from a tightly focused
    graphics design into a high performance and programmable hierarchy that is both well suited for general purpose applications and ready for integration with
    x86 processors. The changes start within the Compute Unit, but extend throughout the entire system.

    The GCN memory hierarchy is a unified read/write caching system with virtual memory support and excellent atomic operation performance. This represents a
    significant improvement over the separate read caching and write buffering paths used in previous generations. Vector memory instructions support variable
    granularity for addresses and data, ranging from 32-bit data to 128-bit pixel quads."
     
  14. rapso

    Newcomer

    Joined:
    May 6, 2008
    Messages:
    215
    Likes Received:
    28
    isn't it just twice the bandwidth of the 4channel speed we have on intel's ...-E cpus?

    imo, it sounds rather like a power saver (by allowing cheaper/slower memory) than something that would benefit on desktops.


    does actually someone have some data on transistor count or die size of modern TMUs, Rasterizers or other Fixed Function units? couldn't find anything useful yet.
     
  15. sebbbi

    Veteran

    Joined:
    Nov 14, 2007
    Messages:
    2,924
    Likes Received:
    5,296
    Location:
    Helsinki, Finland
    Yes, the memory bandwidth of the Xeons/-E models is doubled. That's another nice gain compared to standard consumer models.

    L4 cache cuts the latency roughly to half compared to full cache miss (DDR3 memory access) according to Anandtech benchmarks, so it definitely helps improving performance as well (in addition to energy savings).
     
  16. rapso

    Newcomer

    Joined:
    May 6, 2008
    Messages:
    215
    Likes Received:
    28
    thx for remembering me of those... http://www.anandtech.com/show/6993/intel-iris-pro-5200-graphics-review-core-i74950hq-tested/3 ... seems to be not even 50GB/s in benchmarks. I think I've seen some of the sandy bridge -e scoring around 40GB/s, but on the whole memory range.
    half the latency is still nice, tho now it looks even more like an energy saver for laptops than a performance boost on high end.
     
  17. Nick

    Veteran

    Joined:
    Jan 7, 2003
    Messages:
    1,881
    Likes Received:
    17
    Location:
    Montreal, Quebec
    I don't think you're fully grasping the magnitude of what is achieved here. Quake pretty much spawned the existence of dedicated 3D graphics hardware for consumers. Now things have nearly come full circle.

    The value of having an integrated GPU is quickly disappearing. People who stick with the integrated GPU don't care much about graphics anyway, aside perhaps from running Angry Birds. What would they need fixed-function tessellation hardware for? It's just cheaper to implement that in software when and if it's needed. AVX-512 should even deliver quite adequate performance, and can also be used for a gazillion other high DLP algorithms. Even beyond 3D.

    The CPU also allows for a more 'intelligent' approach versus the brute force philosophy of the GPU. Overdraw elimination is just one example. Recompiling shaders with specialization for different usage scenarios is another one. Maxwell is even supposed to get CPU cores on-die, quite possibly in an attempt to bring the same kind of intelligence to the GPU. But from a compute density and power consumption point of view, it just means more convergence.
     
  18. Davros

    Legend

    Joined:
    Jun 7, 2004
    Messages:
    17,879
    Likes Received:
    5,330
    So nick you predicting a future with cpu's and discreet gfx cards
     
  19. Nick

    Veteran

    Joined:
    Jan 7, 2003
    Messages:
    1,881
    Likes Received:
    17
    Location:
    Montreal, Quebec
    Hardcore gamers were the first willing to pay for discrete graphics cards, and will be the last to keep buying them when the CPU and GPU start unifying from the low-end market up. But in the very long run I see no future for anything discrete. The writing is on the wall. The new generation of consoles already have the CPU and GPU on the same die. It offers a great deal of new possibilities. These consoles and the games developed for them may even act as a catalyst for adding powerful compute capabilities to desktop CPUs and making discrete GPUs more generic.
     
  20. 3dcgi

    Veteran Subscriber

    Joined:
    Feb 7, 2002
    Messages:
    2,493
    Likes Received:
    474
    I think vertex and pixel unification was free as it meant chip architects no longer had to over provision the separate units to account for the cases in which you are completely bottlenecked by one or the other.

    You make many good points so I won't focus on those, but I disagree that previous fixed function tessellators didn't catch on because of limited flexibility. DX11 tessellation doesn't have much more flexibility outside of the hull shader and most uses of DX11 tessellation so far haven't taken advantage of the additional flexibility.

    The multi-pass nature of previous methods might have been an issue for some people, but I view that as a potential performance issue not a case of limited flexibility.
     
Loading...

Share This Page

  • About Us

    Beyond3D has been around for over a decade and prides itself on being the best place on the web for in-depth, technically-driven discussion and analysis of 3D graphics hardware. If you love pixels and transistors, you've come to the right place!

    Beyond3D is proudly published by GPU Tools Ltd.
Loading...