Your wishlist for DX12

I personally believe that triangles will be one of the reasons why we are leaving rasterization behind (if it happens in future). Subpixel sized triangles are a huge waste (of processing power, bandwidth and memory). Pure triangles are very inefficient in modeling high detail meshes (high detail CAD model geometry can be several gigabytes of size, and that's just for a single building or airplane). With other methods you can have the same level of detail with much smaller memory footprint (and smaller bandwidth requirements). For example patches with displacement maps are considerably better, but are a much harder problem for ray intersection calculation. Voxels on the other hand are very efficient target for ray casting (while being somewhere in the middle of the two in storage requirement for super high detail models, depending of course of the scene).

If you were doing Voxels in the future, you wouldn't use fixed function hardware would you? Maybe Voxel support could be part of the next D3d specification?
 
I believe his stance has evolved somewhat in the time since 2008. (Not necessarily towards tessellation, but some differences with regards to where the alternatives can be applied.)
The topic came up again sometime around the release of Rage, and he seemed to have a few other ideas he wanted to try out.
I haven't tracked down which interview that was.
 
I believe his stance has evolved somewhat in the time since 2008. (Not necessarily towards tessellation, but some differences with regards to where the alternatives can be applied.)
The topic came up again sometime around the release of Rage, and he seemed to have a few other ideas he wanted to try out.
I haven't tracked down which interview that was.

Carmak's opinion has definetly changed on tessellation:

He is first enquired about current hardware tessellation at 21min49s,

http://youtu.be/hapCuhAs1nA?t=21m49s

And at 24m53s comes his surprising conclusion on the topic:

http://www.youtube.com/watch?v=hapCuhAs1nA&t=24m53s

"The smart money bet for a next generation console (..) when people build technologies from scrach, the smart money would be on a tessellation based one, down to the micropolygon level, with outside bets in voxel raytracing hybrid type engines."
 
I'd love a 20MP 30" monitor, but it just isn't going to happen this side of 2020.

8k tv's are already planned for before then, and it will probably happen too.

As for voxels/ray tracing for any kind of primary (geometry) purpose/shading, that rings somewhat of fusion. It seems great in hypothetical scenarios, and then when you get right down to working around it practically it just falls apart.

I'm sure we'll have unlimited detail and realtime raytracing someday, just like we'll be running our PCs off our neighborhood fusion plants someday. But for the foreseeable future there's just too much overhead, especially for things like storage of voxels as art assets. Not too mention other people want all that compute power for simulation stuff. Frankly I'd rather see this in a game: http://www.youtube.com/watch?v=47riga-OVro than spend all those resources just to make it so I could use raytracing.
 
8k tv's are already planned for before then, and it will probably happen too.

As for voxels/ray tracing for any kind of primary (geometry) purpose/shading, that rings somewhat of fusion. It seems great in hypothetical scenarios, and then when you get right down to working around it practically it just falls apart.

I'm sure we'll have unlimited detail and realtime raytracing someday, just like we'll be running our PCs off our neighborhood fusion plants someday. But for the foreseeable future there's just too much overhead, especially for things like storage of voxels as art assets. Not too mention other people want all that compute power for simulation stuff. Frankly I'd rather see this in a game: http://www.youtube.com/watch?v=47riga-OVro than spend all those resources just to make it so I could use raytracing.

I could see voxels beeing used in plenty of defferred rendered games for particles, special weird looking and very detailed objects, volumetric clouds and dust, and other things of that type. Now most of the environment, and its long flat concreet walls, and skinned animated characters will probably stick to polygons for a long time.
 
I could see voxels beeing used in plenty of defferred rendered games for particles, special weird looking and very detailed objects, volumetric clouds and dust, and other things of that type. Now most of the environment, and its long flat concreet walls, and skinned animated characters will probably stick to polygons for a long time.

Oh sure sure, I'd love to see it for particles, the current billboards (even with all their hacks) just don't cut it. And both Cryengine 3 and Unreal Engine 4 use a voxel data structure for global illumination, and UE4 even uses cone tracing for GI.

I was more referring to the discussion going towards such being used primarily as opposed to our classic vertex/pixel/what have you programs and polygons.
 
AFAIK, the tessellator in xbox 360 is mostly unused. Would you say that it has disadvantaged non-tessellated geometry rendering?
Yes of course. If they had chosen not to include the tessellator those transistors could be spent to other features (such as one extra shader ALU), or just be removed (to improve yields and manufacturing costs, and thus reducing the console price). Simple as that. This is the exact reason why I feel that adding large amount of fixed function transistors for features that are not yet being established (such as real time raytracing) is not a good decision. It will be a hit or a miss. A big gamble really.

If huge amount of transistors are spend to some rendering technique that's not yet proven to be usable, there's always the risk of not giving developers what they exactly want. There will be limitations that nobody though about, because the rendering technique wasn't yet widely used and researched thoughout. Also the fixed function hardware solution is likely going to result in less research in alternative techniques, and that's not a good thing, if one of the alternative techniques proves to be better in the long run.

Basically you should only use fixed function hardware for established tasks, that are thoroughly researched, the "best" solution for the problem exists, is agreed by most developers and the task is commonly used in huge majority of games. Good examples: texture samplers, triangle setup, CPU caches, math (sqrt, exp, estimates...), texture decompression, depth buffering (including hierachial depth and depth compression), general purpose compression/decompression (LZ-style), video decompression (for parts that benefit most from FF hardware), audio mixing...

---

I purposefully left MSAA out of the above list. Let me explain it a bit.

Efficient 4xMSAA means that you have to:
1. 4x fixed function point-inside-triangle hardware
2. 4x fixed function depth testing hardware
3. 4x fixed function stencil testing hardware
4. 4x fixed function blending hardware
5. 4x wider data path from rops to memory (but not any wider from PS to rops)
6. More memory and more memory bandwidth (for backbuffer operations) and/or MSAA color compression an depth compression (again more fixed function hardware)

That's a lot of hardware. And MSAA is only used by small minority of current generation console games (and not in all PC games either). Even in those games that use MSAA, the MSAA hardware is only used for around 20% of the time (rendering objects to g-buffers). It does nothing for 80% of the time (it's not needed in shadow map rendering, lighting and post processing).

Current PC GPUs solve the MSAA backbuffer bandwidth demands with depth and color compression hardware (more fixed function hardware), and the rest is solved by making rops (much) fatter. Xbox 360 solved the problem with EDRAM. It had the required bandwith and built in fixed function processors for blending/z-buffering/stenciling. EDRAM solved all the six requirements.

Now let me talk about shadow map rendering. It's an operation that is heavily fill rate bound. In games that have lots of dynamic shadow casting lights, shadow map rendering can take up to 30% of the frame time. Shadow map rendering mainly requires fixed function depth buffering hardware (and storing depth samples to render target). These are the main bottlenecks of it. GPUs with 4xMSAA have four times more fixed function depth buffering hardware (and bus/bw/compression to store the depth values). That must be a very good thing for shadow mapping? Will it provide 4x depth fill rate?

Unfortunately no... You can render the shadowmap to 4xMSAA buffer, and have proper separate depth testing to each subpixel and have 4x fillrate. However the fixed function point-inside-triangle testers have fixed offsets (optimized MSAA pattern), and these offsets do not form a regular grid. So the sampling from the rendered shadowmap texture is not straightforward. Even the simplest nearest neighbor sampling requires excessive amount of math (dozens of instructions). Because one of the six fixed function components is not up to the task (1), all the others (2-6) are pretty much unusable as well for shadow mapping. What a shame.

The lesson here is: Even if some rendering technique is currently very popular (MSAA antialiasing), it might suddenly become a niche because of new research (deferred shading). MSAA depth testing hardware would be very good for shadow map rendering (4x fill rate), but because a single small mistake in the FF hardware design it cannot be used for shadow map rendering at all (two sets of MSAA sample offsets, and a flag to select preferred one would be enough to solve the issue). ---> Fixed function hardware is very fragile. Even tiny limitations in it might make it unusable.

All this extra fixed function hardware for 4xMSAA makes rops very fat. Current generation consoles have only eight rops, and thus are often fill rate bound. I would trade all that MSAA hardware for 16 rops any day (and even that would halve the fixed function hardware listed). EDRAM has the bandwidth, that's not a problem.
when power is the major constraining factor why do we care about utilisation? FF will be "king" so long as it allows continued progress while offering ultimately Superior performance. If utilisation was the key metric VLIW never would have happened. being able to do whatever you want however you want has a price too, larrabee wasn't able to pay the price of admission :LOL:.
VLIW happened. Separate vertex and pixel shaders also happened...

Processors with VLIW such as Intel Itanium and Transmeta Crusoe are dead. Itanium was always plagued with heat problems. GPUs with VLIW are going the same route: NVidia dropped both VLIW and separate pixel and vertex shaders in Geforce 8000 series, and we all know how that story ended (8800 is one of the most popular graphics hardware ever). AMD kept VLIW alive for longer, but dropped it in GCN. So VLIW is as good as dead (in both CPU and GPU markets).

Utilization is important. Pentium 4 pipeline was long to allow very high clocks. It's very hard to keep a long pipeline properly utilized. Intel's Core 2 architecture came up with a much shorter pipeline that allowed much higher hardware utilization. P4 had much higher theoretical peak numbers, but the improved utilization was the key to both good performance and good efficiency. Thus Core 2 was the biggest jump in perf/watt we have ever seen in CPUs.

---

Also you have to think about chip manufacturing costs. Each fixed function unit requires some extra transistors. And these transistors have very low utilization ratio (are idling most of the time even in software that uses them). These units also require separate design (research cost), while you can often just replicate the programmable units many times to scale up the performance, and slighly improve the design from generation to another. Replication is also a boon for programmable units, because of the yields. You can just add a few extra programmable units to improve yields (and thus lower production costs). A programmable unit can replace any programmable unit if there is a defect in the chip manufacturing. For fixed function units, you need to have a separate spare for each fixed function unit type (a broken video decoder cannot be replaced by anything else, so you might need to have two of them in the chip just to improve yields). Additional broken programmable units often only degrade the performance, while the feature set stays the same. This provides an cost effective method for selling hardware for lower price points.
 
The thing that annoys me most in fixed function hardware is that it limits my freedom. Fixed function hardware (and ratio of different FF hardware) is designed for the needs of an average game (usually last generation game, since hardware development takes time), and if your game is anything different, the hardware doesn't suit your needs that well. Games are different, and have different hardware requirements. Programmable hardware allows game developers to implement the game they want, not something that hardware manufacturer has planned for them. Creative freedom is important for good quality games. We need more different games, not hardware limitations that force us to do things in certain way.

I have been a graphics programmer since 3dfx era (Voodoo, Riva 128, TNT, etc are familiar for me). Ever since fixed function T&L came, graphics hardware has had major limitations compared to software graphics engines. You just had to cope with the limitations and invent creative hacks to get the required graphics features implemented. The time period when half of the cards supported EMBM (environment mapped bump mapping) and half of the cards supported DOT3 bumpmapping is something I really want to forget. Luckily DX8 was released, and we finally got shaders that allowed us to implement either one, and much much more. I have seen the golden age of fixed function graphics programming, and I don't want to go back there. Most of the things graphics programmers were doing back then was hacking their way around fixed function hardware limitations.

The way tessellation was implemented in DX11 API isn't that bad really. Domain shaders and hull shaders are written with the same language as all the other shaders and utilize the same programmable hardware. Only the part of the system that spawns new GPU work items is fixed function. We don't have yet programmable way to spawn work items on GPU, so this is understandable. But (in the future) when we get a proper way to do that, there might no longer be a need for separate hull and domain shaderes. Of course that is what I also want from the future.

Fixed function hardware bottlenecks are a real thing. When thinking about our last console project, I think that we were bottlenecked by triangle setup, texture sampling, vertex fetch, ALU, fill rate (rops), texture caches and BW :). All in different stages of our scene rendering. Shadow map rendering for example was rop bound for near objects and triangle setup or vertex fetch bound for far away objects (depending on object type). Object rendering was texture sampling bound for some objects, texture cache bound for some (reflective), vertex fetch bound for some (high polygon far away) and ALU bound for some (very near objects with heavy shaders). Deferred lighting was ALU bound, and post processing was texture sampling bound. Most of the time during the frame we were bottlenecked by some fixed function unit. ALU was sitting there doing nothing during that time, and we were still heavily ALU bound during lighting. That's what happens when you have too many different bottlenecks (different fixed function units). We of course did some creative hacks to for example reduce vertex fetching bottlenecks (compress vertices, and burn some ALU to decompress) and to reduce texture fetching / BW bottlenecks (encoding/decoding), but optimizations like this were only a win for some cases, so the combination of shaders increased, and debugging and profiling time increased drastically. This is what you have to do in order to cope with the GPU fixed function bottlenecks. It takes lots of programming resources = makes games cost more to develop.
 
Yes of course. If they had chosen not to include the tessellator those transistors could be spent to other features (such as one extra shader ALU), or just be removed (to improve yields and manufacturing costs, and thus reducing the console price). Simple as that.
While every fixed function feature is going to be different removing the Xbox 360's tessellator wouldn't have allowed for another ALU or reduced the console price. I doubt Microsoft can even measure the yield reduction without going to a lot of decimal places.

Your post is correct in many areas, but I wanted to note that if a fixed function feature requires little enough hardware its production cost can effectively be hidden. Sections of the chip are synthesized and laid out separately and depending on how easy the section is to route and time utilization can be increased without negatively affecting area.

My main point is every extra gate in a chip isn't felt, but changes that are large enough to contribute percentage points to a section will start to affect area.
 
While every fixed function feature is going to be different removing the Xbox 360's tessellator wouldn't have allowed for another ALU or reduced the console price.
ATI already had the tessellator designed and integrated to their geometry processing unit. It was included in all of their DX9 chips. So the research and integration costs were already paid. The tessellator must have been a really small part of the chip, since ATI chose to include it in all their DX9 cards, even though it was pretty much never used (in PC games). Also ATI never did boost up other parts of their chips (triangle setup, etc) to cope up with the increased triangle counts made by the tessellator. It was just a cheap addition, not a huge feature that affected the whole GPU design.

However if you look at Fermi's architecture, it's evident that efficient tessellation required huge research costs and changes to chip layout. Nvidia split their rops (incl. edge setup) to four separate parallel units, and split their geometry processing units ("Polymorph Engine") to 16 parallel units. All of these fixed function units were previously located in a single unit. The new system required new communication paths between the geometry processing units (parallel execution hazards need to be solved), and lots of other changes. All these big architecture changes were implemented for tessellation. Tessellation hasn't still being used by more than a few games and benchmarks. Most Fermi cards are no longer in use when tessellation finally gets popular. So spending huge amount of transistors to tessellation wasn't really a good call for Fermi. Like I said earlier, fixed function hardware is a huge gamble. A link discussing all the changes in Fermi's geometry processing: http://www.anandtech.com/show/2918/2
 
Removing the tessellator wouldn't necessarily allow them to add more ALUs even if it gave them the area to do it if you know what I mean. The chip already drew a lot of power.
 
The tessellator unit in both R600 and R700 architectures took no more than 2M transistors - a drop in the ocean of the whole budget. Xenos' implementation probably took even less than that, for a first generation product.
 
ATI already had the tessellator designed and integrated to their geometry processing unit. It was included in all of their DX9 chips.

If your referring to Tru-Form I dont think thats true, the 8500 certainly had it but afaik it was done in the driver(aka software) on later cards then in later drivers abandoned.
 
However if you look at Fermi's architecture, it's evident that efficient tessellation required huge research costs and changes to chip layout. Nvidia split their rops (incl. edge setup) to four separate parallel units, and split their geometry processing units ("Polymorph Engine") to 16 parallel units. All of these fixed function units were previously located in a single unit. The new system required new communication paths between the geometry processing units (parallel execution hazards need to be solved), and lots of other changes. All these big architecture changes were implemented for tessellation. Tessellation hasn't still being used by more than a few games and benchmarks. Most Fermi cards are no longer in use when tessellation finally gets popular. So spending huge amount of transistors to tessellation wasn't really a good call for Fermi. Like I said earlier, fixed function hardware is a huge gamble. A link discussing all the changes in Fermi's geometry processing: http://www.anandtech.com/show/2918/2
Tessellation can easily be made efficient with a single rasterizer and I agree Fermi did a good job with their multiple rasterizers. However, I bet they are happy with the area addition for it. It doesn't help much with games, but it helps tremendously in the workstation market. At AMD tessellation wasn't the reason we went with 2x prim rate for Cayman.
 
The thing that annoys me most in fixed function hardware is that it limits my freedom. Fixed function hardware (and ratio of different FF hardware) is designed for the needs of an average game (usually last generation game, since hardware development takes time), and if your game is anything different, the hardware doesn't suit your needs that well. Games are different, and have different hardware requirements. Programmable hardware allows game developers to implement the game they want, not something that hardware manufacturer has planned for them. Creative freedom is important for good quality games. We need more different games, not hardware limitations that force us to do things in certain way.

I have been a graphics programmer since 3dfx era (Voodoo, Riva 128, TNT, etc are familiar for me). Ever since fixed function T&L came, graphics hardware has had major limitations compared to software graphics engines. You just had to cope with the limitations and invent creative hacks to get the required graphics features implemented. The time period when half of the cards supported EMBM (environment mapped bump mapping) and half of the cards supported DOT3 bumpmapping is something I really want to forget. Luckily DX8 was released, and we finally got shaders that allowed us to implement either one, and much much more. I have seen the golden age of fixed function graphics programming, and I don't want to go back there. Most of the things graphics programmers were doing back then was hacking their way around fixed function hardware limitations.
This era isn't the same as that era. Then you could only pack ff hw (which has amazing area efficiency) only for bits and pieces of the pipeline. Today you can do virtually entire pipeline in sw with reasonable perf. With more programmability it becomes easier all the time.

Let's say you take out 1 CU from Tahiti to put some ff hw for X in there. Your compute goes down by 3%. I am not advocating putting this at the expense of any flexibility one would desire as a programmer. What you can build today in 25M (50M?) transistors is nothing to sneeze at. And this marginal budget will increase all the time. Think about it, which algorithm is it that will fall off the usability cliff with 3% reduction in compute? No technique is affected adversely in any meaningful way. But there is one technique out there that is sped up 10x.

Now you could argue that the technique being implemented in ff should be well thought out. I totally agree. Tessellation took many goes to get it right (GS <-> xbox tessellator -> dx11). But there is more compute out there all the time and less and less of it will be usable, thanks to dark silicon. Even if your chip was wall to wall carpeted with compute, you wouldn't be able to use all of it all the time.

The way tessellation was implemented in DX11 API isn't that bad really. Domain shaders and hull shaders are written with the same language as all the other shaders and utilize the same programmable hardware. Only the part of the system that spawns new GPU work items is fixed function. We don't have yet programmable way to spawn work items on GPU, so this is understandable. But (in the future) when we get a proper way to do that, there might no longer be a need for separate hull and domain shaderes. Of course that is what I also want from the future.

Fixed function hardware bottlenecks are a real thing. When thinking about our last console project, I think that we were bottlenecked by triangle setup, texture sampling, vertex fetch, ALU, fill rate (rops), texture caches and BW :). All in different stages of our scene rendering. Shadow map rendering for example was rop bound for near objects and triangle setup or vertex fetch bound for far away objects (depending on object type). Object rendering was texture sampling bound for some objects, texture cache bound for some (reflective), vertex fetch bound for some (high polygon far away) and ALU bound for some (very near objects with heavy shaders). Deferred lighting was ALU bound, and post processing was texture sampling bound. Most of the time during the frame we were bottlenecked by some fixed function unit. ALU was sitting there doing nothing during that time, and we were still heavily ALU bound during lighting. That's what happens when you have too many different bottlenecks (different fixed function units). We of course did some creative hacks to for example reduce vertex fetching bottlenecks (compress vertices, and burn some ALU to decompress) and to reduce texture fetching / BW bottlenecks (encoding/decoding), but optimizations like this were only a win for some cases, so the combination of shaders increased, and debugging and profiling time increased drastically. This is what you have to do in order to cope with the GPU fixed function bottlenecks. It takes lots of programming resources = makes games cost more to develop.
Consoles are so primitive that one has to dig deep to get anything useful done there. That kind of development is more exception than the norm. While it is true that your game was bound in different stages at different times, this whack a mole bottleneck is tempting to remove by going the mostly-sw route. But any reasonable analysis of bottlenecks in a mostly sw system vs whack a mole bottlenecks would reveal that a mostly sw route would have far worse bottlenecks in one place.
 
At AMD tessellation wasn't the reason we went with 2x prim rate for Cayman.
And I agree that was a very good compromise for this generation of hardware. It removes bottlenecks of super high polygon mesh rendering, and improves tessellation performance, but didn't require huge overhaul of the architecture (or major commitment to rendering techniques that are not yet widely used).
Removing the tessellator wouldn't necessarily allow them to add more ALUs even if it gave them the area to do it if you know what I mean. The chip already drew a lot of power.
Current generation consoles didn't have any hardware for dynamic clock adjustments (turbo or throttling). The thermals had to be designed to handle full load on all units at the same time. Power consumed by all fixed function units in full load had to be surely taken into account in TDP design. Removing the tessellator for example might have not affected much, but removal of 4xMSAA hardware (including the EDRAM part of it) would surely have had big impact (on many things). But who would have though that popularity of MSAA would decrease so rapidly four/five years after the console launch. So I cannot critisize this decision. But in generally it shows that even big chunks of fixed function hardware can be left unused when rendering techniques evolve. Fixed function is not as future proof as programmable hardware.

Of course this is different for recent hardware that is designed to scale clocks up/down based on demand. All recent CPUs have turbo clocks, and GPUs are also getting similar features (Intel's GPUs have 3x clock scaling already). Mobile SoCs also have good clock scaling built in. This of course makes fixed function hardware a slighly smaller risk, because if it is unused, at least you can clock up the rest of the chip.
Consoles are so primitive that one has to dig deep to get anything useful done there. That kind of development is more exception than the norm.
Console development isn't an "exception of the norm" for game programmers. Practically all AAA games are released on consoles, and most PC games are ports of console games. So game rendering pipelines need to be designed around console fixed function hardware bottlenecks. Things are moved around to minimize bottlenecks at every rendering stage (there are for example dozens of different deferred lighting/shading techniques just to combat different bottlenecks). These decisions also affect PC ports, as rewriting the whole rendering pipeline isn't something that most developers are willing to do lightly.
While it is true that your game was bound in different stages at different times, this whack a mole bottleneck is tempting to remove by going the mostly-sw route. But any reasonable analysis of bottlenecks in a mostly sw system vs whack a mole bottlenecks would reveal that a mostly sw route would have far worse bottlenecks in one place.
But doesn't it feel even a little bit wrong also for a PC developer to have a whopping 2048 shader units (Tahiti) idling while you render your shadowmaps (bottlenecked by fixed function triangle setup or rops)? These 2048 shader units would surely be able to cruch a lot of triangle setup math. Or have 16 turbocharged geometry engines (Kepler) idling while you do deferred lighting and post processing?

I am not suggesting going fully sw route. Fixed function is fine for small repeated tasks. But I do not want to have more large scale fixed function units either. The goal must be more programmable/flexible pipeline in the future (not less).
 
Last edited by a moderator:
At least, texture units and ROPs will remain dedicated HQ for the next few generations. They represent logic structures with very well determined functions/benefits, giving unbeatable performance, power and size ratio, no programmable logic array could even come close to. On the other hand, there's enough ALU throughput to complement and enhance those dedicated units, beyond their capabilities. Few examples: higher order mag filtering for the shadow maps, advanced post-AA on top of the hardware MS (TXAA), Elliptical Texture Filtering using the hardware AF as an input, etc.
 
Back
Top