If this is your first visit, be sure to check out the FAQ by clicking the link above. You may have to register before you can post: click the register link above to proceed. To start viewing messages, select the forum that you want to visit from the selection below.
![]() |
|
|
#26 | |
|
Member
Join Date: Jun 2012
Posts: 298
|
Quote:
|
|
|
|
|
|
|
#27 | |
|
Junior Member
Join Date: Nov 2011
Posts: 16
|
Quote:
I was more referring to the discussion going towards such being used primarily as opposed to our classic vertex/pixel/what have you programs and polygons. |
|
|
|
|
|
|
#28 | ||
|
Member
Join Date: Nov 2007
Posts: 990
|
Quote:
If huge amount of transistors are spend to some rendering technique that's not yet proven to be usable, there's always the risk of not giving developers what they exactly want. There will be limitations that nobody though about, because the rendering technique wasn't yet widely used and researched thoughout. Also the fixed function hardware solution is likely going to result in less research in alternative techniques, and that's not a good thing, if one of the alternative techniques proves to be better in the long run. Basically you should only use fixed function hardware for established tasks, that are thoroughly researched, the "best" solution for the problem exists, is agreed by most developers and the task is commonly used in huge majority of games. Good examples: texture samplers, triangle setup, CPU caches, math (sqrt, exp, estimates...), texture decompression, depth buffering (including hierachial depth and depth compression), general purpose compression/decompression (LZ-style), video decompression (for parts that benefit most from FF hardware), audio mixing... --- I purposefully left MSAA out of the above list. Let me explain it a bit. Efficient 4xMSAA means that you have to: 1. 4x fixed function point-inside-triangle hardware 2. 4x fixed function depth testing hardware 3. 4x fixed function stencil testing hardware 4. 4x fixed function blending hardware 5. 4x wider data path from rops to memory (but not any wider from PS to rops) 6. More memory and more memory bandwidth (for backbuffer operations) and/or MSAA color compression an depth compression (again more fixed function hardware) That's a lot of hardware. And MSAA is only used by small minority of current generation console games (and not in all PC games either). Even in those games that use MSAA, the MSAA hardware is only used for around 20% of the time (rendering objects to g-buffers). It does nothing for 80% of the time (it's not needed in shadow map rendering, lighting and post processing). Current PC GPUs solve the MSAA backbuffer bandwidth demands with depth and color compression hardware (more fixed function hardware), and the rest is solved by making rops (much) fatter. Xbox 360 solved the problem with EDRAM. It had the required bandwith and built in fixed function processors for blending/z-buffering/stenciling. EDRAM solved all the six requirements. Now let me talk about shadow map rendering. It's an operation that is heavily fill rate bound. In games that have lots of dynamic shadow casting lights, shadow map rendering can take up to 30% of the frame time. Shadow map rendering mainly requires fixed function depth buffering hardware (and storing depth samples to render target). These are the main bottlenecks of it. GPUs with 4xMSAA have four times more fixed function depth buffering hardware (and bus/bw/compression to store the depth values). That must be a very good thing for shadow mapping? Will it provide 4x depth fill rate? Unfortunately no... You can render the shadowmap to 4xMSAA buffer, and have proper separate depth testing to each subpixel and have 4x fillrate. However the fixed function point-inside-triangle testers have fixed offsets (optimized MSAA pattern), and these offsets do not form a regular grid. So the sampling from the rendered shadowmap texture is not straightforward. Even the simplest nearest neighbor sampling requires excessive amount of math (dozens of instructions). Because one of the six fixed function components is not up to the task (1), all the others (2-6) are pretty much unusable as well for shadow mapping. What a shame. The lesson here is: Even if some rendering technique is currently very popular (MSAA antialiasing), it might suddenly become a niche because of new research (deferred shading). MSAA depth testing hardware would be very good for shadow map rendering (4x fill rate), but because a single small mistake in the FF hardware design it cannot be used for shadow map rendering at all (two sets of MSAA sample offsets, and a flag to select preferred one would be enough to solve the issue). ---> Fixed function hardware is very fragile. Even tiny limitations in it might make it unusable. All this extra fixed function hardware for 4xMSAA makes rops very fat. Current generation consoles have only eight rops, and thus are often fill rate bound. I would trade all that MSAA hardware for 16 rops any day (and even that would halve the fixed function hardware listed). EDRAM has the bandwidth, that's not a problem. Quote:
Processors with VLIW such as Intel Itanium and Transmeta Crusoe are dead. Itanium was always plagued with heat problems. GPUs with VLIW are going the same route: NVidia dropped both VLIW and separate pixel and vertex shaders in Geforce 8000 series, and we all know how that story ended (8800 is one of the most popular graphics hardware ever). AMD kept VLIW alive for longer, but dropped it in GCN. So VLIW is as good as dead (in both CPU and GPU markets). Utilization is important. Pentium 4 pipeline was long to allow very high clocks. It's very hard to keep a long pipeline properly utilized. Intel's Core 2 architecture came up with a much shorter pipeline that allowed much higher hardware utilization. P4 had much higher theoretical peak numbers, but the improved utilization was the key to both good performance and good efficiency. Thus Core 2 was the biggest jump in perf/watt we have ever seen in CPUs. --- Also you have to think about chip manufacturing costs. Each fixed function unit requires some extra transistors. And these transistors have very low utilization ratio (are idling most of the time even in software that uses them). These units also require separate design (research cost), while you can often just replicate the programmable units many times to scale up the performance, and slighly improve the design from generation to another. Replication is also a boon for programmable units, because of the yields. You can just add a few extra programmable units to improve yields (and thus lower production costs). A programmable unit can replace any programmable unit if there is a defect in the chip manufacturing. For fixed function units, you need to have a separate spare for each fixed function unit type (a broken video decoder cannot be replaced by anything else, so you might need to have two of them in the chip just to improve yields). Additional broken programmable units often only degrade the performance, while the feature set stays the same. This provides an cost effective method for selling hardware for lower price points. |
||
|
|
|
|
|
#29 |
|
Member
Join Date: Nov 2007
Posts: 990
|
The thing that annoys me most in fixed function hardware is that it limits my freedom. Fixed function hardware (and ratio of different FF hardware) is designed for the needs of an average game (usually last generation game, since hardware development takes time), and if your game is anything different, the hardware doesn't suit your needs that well. Games are different, and have different hardware requirements. Programmable hardware allows game developers to implement the game they want, not something that hardware manufacturer has planned for them. Creative freedom is important for good quality games. We need more different games, not hardware limitations that force us to do things in certain way.
I have been a graphics programmer since 3dfx era (Voodoo, Riva 128, TNT, etc are familiar for me). Ever since fixed function T&L came, graphics hardware has had major limitations compared to software graphics engines. You just had to cope with the limitations and invent creative hacks to get the required graphics features implemented. The time period when half of the cards supported EMBM (environment mapped bump mapping) and half of the cards supported DOT3 bumpmapping is something I really want to forget. Luckily DX8 was released, and we finally got shaders that allowed us to implement either one, and much much more. I have seen the golden age of fixed function graphics programming, and I don't want to go back there. Most of the things graphics programmers were doing back then was hacking their way around fixed function hardware limitations. The way tessellation was implemented in DX11 API isn't that bad really. Domain shaders and hull shaders are written with the same language as all the other shaders and utilize the same programmable hardware. Only the part of the system that spawns new GPU work items is fixed function. We don't have yet programmable way to spawn work items on GPU, so this is understandable. But (in the future) when we get a proper way to do that, there might no longer be a need for separate hull and domain shaderes. Of course that is what I also want from the future. Fixed function hardware bottlenecks are a real thing. When thinking about our last console project, I think that we were bottlenecked by triangle setup, texture sampling, vertex fetch, ALU, fill rate (rops), texture caches and BW |
|
|
|
|
|
#30 |
|
Member
Join Date: Sep 2003
Location: UK, Bedfordshire
Posts: 450
|
Thanks for the last two posts sebbbi. They were an absolute joy to read
__________________
PeterAce "Lost in quantisation" |
|
|
|
|
|
#31 | |
|
Senior Member
Join Date: Feb 2002
Posts: 2,036
|
Quote:
Your post is correct in many areas, but I wanted to note that if a fixed function feature requires little enough hardware its production cost can effectively be hidden. Sections of the chip are synthesized and laid out separately and depending on how easy the section is to route and time utilization can be increased without negatively affecting area. My main point is every extra gate in a chip isn't felt, but changes that are large enough to contribute percentage points to a section will start to affect area. |
|
|
|
|
|
|
#32 | |
|
Member
Join Date: Nov 2007
Posts: 990
|
Quote:
However if you look at Fermi's architecture, it's evident that efficient tessellation required huge research costs and changes to chip layout. Nvidia split their rops (incl. edge setup) to four separate parallel units, and split their geometry processing units ("Polymorph Engine") to 16 parallel units. All of these fixed function units were previously located in a single unit. The new system required new communication paths between the geometry processing units (parallel execution hazards need to be solved), and lots of other changes. All these big architecture changes were implemented for tessellation. Tessellation hasn't still being used by more than a few games and benchmarks. Most Fermi cards are no longer in use when tessellation finally gets popular. So spending huge amount of transistors to tessellation wasn't really a good call for Fermi. Like I said earlier, fixed function hardware is a huge gamble. A link discussing all the changes in Fermi's geometry processing: http://www.anandtech.com/show/2918/2 |
|
|
|
|
|
|
#33 | |
|
hardly a Senior Member
Join Date: Jul 2008
Location: still camping with a mauler
Posts: 3,676
|
Removing the tessellator wouldn't necessarily allow them to add more ALUs even if it gave them the area to do it if you know what I mean. The chip already drew a lot of power.
__________________
Quote:
|
|
|
|
|
|
|
#34 |
|
Senior Member
|
The tessellator unit in both R600 and R700 architectures took no more than 2M transistors - a drop in the ocean of the whole budget. Xenos' implementation probably took even less than that, for a first generation product.
__________________
Apple: China -- Brutal leadership done right.
Google: United States -- Somewhat democratic. Microsoft: Russia -- Big and bloated. Linux: EU -- Diverse and broke. |
|
|
|
|
|
#35 |
|
Darlek ******
Join Date: Jun 2004
Posts: 9,661
|
If your referring to Tru-Form I dont think thats true, the 8500 certainly had it but afaik it was done in the driver(aka software) on later cards then in later drivers abandoned.
__________________
Guardian of the Most holy Two Terabytes of Gaming Goodness™ |
|
|
|
|
|
#36 | |
|
Senior Member
Join Date: Feb 2002
Posts: 2,036
|
Quote:
|
|
|
|
|
|
|
#37 | ||
|
Senior Member
|
Quote:
Let's say you take out 1 CU from Tahiti to put some ff hw for X in there. Your compute goes down by 3%. I am not advocating putting this at the expense of any flexibility one would desire as a programmer. What you can build today in 25M (50M?) transistors is nothing to sneeze at. And this marginal budget will increase all the time. Think about it, which algorithm is it that will fall off the usability cliff with 3% reduction in compute? No technique is affected adversely in any meaningful way. But there is one technique out there that is sped up 10x. Now you could argue that the technique being implemented in ff should be well thought out. I totally agree. Tessellation took many goes to get it right (GS <-> xbox tessellator -> dx11). But there is more compute out there all the time and less and less of it will be usable, thanks to dark silicon. Even if your chip was wall to wall carpeted with compute, you wouldn't be able to use all of it all the time. Quote:
|
||
|
|
|
|
|
#38 |
|
Tea maker
Join Date: Feb 2002
Location: In the Island of Sodor, where the steam trains lie
Posts: 4,396
|
It's right?
__________________
"Your work is both good and original. Unfortunately the part that is good is not original and the part that is original is not good." -(attributed to) Samuel Johnson "I invented the term Object-Oriented, and I can tell you I did not have C++ in mind." Alan Kay |
|
|
|
|
|
#39 | ||||
|
Member
Join Date: Nov 2007
Posts: 990
|
Quote:
Quote:
Of course this is different for recent hardware that is designed to scale clocks up/down based on demand. All recent CPUs have turbo clocks, and GPUs are also getting similar features (Intel's GPUs have 3x clock scaling already). Mobile SoCs also have good clock scaling built in. This of course makes fixed function hardware a slighly smaller risk, because if it is unused, at least you can clock up the rest of the chip. Quote:
Quote:
I am not suggesting going fully sw route. Fixed function is fine for small repeated tasks. But I do not want to have more large scale fixed function units either. The goal must be more programmable/flexible pipeline in the future (not less). Last edited by sebbbi; 26-Jul-2012 at 16:53. |
||||
|
|
|
|
|
#40 |
|
Senior Member
|
At least, texture units and ROPs will remain dedicated HQ for the next few generations. They represent logic structures with very well determined functions/benefits, giving unbeatable performance, power and size ratio, no programmable logic array could even come close to. On the other hand, there's enough ALU throughput to complement and enhance those dedicated units, beyond their capabilities. Few examples: higher order mag filtering for the shadow maps, advanced post-AA on top of the hardware MS (TXAA), Elliptical Texture Filtering using the hardware AF as an input, etc.
__________________
Apple: China -- Brutal leadership done right.
Google: United States -- Somewhat democratic. Microsoft: Russia -- Big and bloated. Linux: EU -- Diverse and broke. |
|
|
|
|
|
#41 |
|
yes, i'm drunk
|
nVidia supports max of 4, not 6.
__________________
I'm nothing but a shattered soul... Been ravaged by the chaotic beauty... Ruined by the unreal temptations... I was betrayed by my own beliefs... |
|
|
|
|
|
#42 |
|
Darlek ******
Join Date: Jun 2004
Posts: 9,661
|
thats why I said up to
__________________
Guardian of the Most holy Two Terabytes of Gaming Goodness™ |
|
|
|
|
|
#43 |
|
Senior Member
|
Well it's used atleast. Much more than can be said for GS (as geometry amplifier) or xbox360.
|
|
|
|
|
|
#44 | ||
|
Senior Member
|
Quote:
Besides, idle ALUs are better than saturated power hungry ALUs that turn out to be slower. Quote:
|
||
|
|
|
|
|
#45 | |
|
Member
Join Date: Nov 2007
Posts: 990
|
Quote:
The ability to run multiple kernels in parallel was first introduced in Fermi GF-100 (http://www.anandtech.com/show/2849/5). It naturally requires that the kernels do not share (other than read only) resources (for example backbuffer), and it can only take draw calls (kernels) from the command list in the order they are submitted (*). This limits the usability a lot. Kepler's Hyper-Q improved this situation significantly. Kepler can fetch commands from up to 32 GPU command lists. This however doesn't help with graphics rendering, since current DX11 multithreading model is based on a single GPU command list (other threads just render to software command buffers that are submitted to one GPU command list, one after each other). This is one of the things that I hope is improved in DX12, since also AMDs heterogeneous system architecture (HSA) slides are also talking about running multiple contexts/kernels in parallel and feeding GPU by multiple command lists (GPU support will be there for both manufacturers). This is also a great way to reduce compute latency (user can set higher priority for command list that is used for submitting compute work). (*) APIs do guarantee that your draw calls will be executed in the order they were submitted (result being identical to that order of execution). Since you usually have thousands of draw calls in your shadow map rendering, all tasks you can find to execute in parallel are using the same simple shader and are bottlenecked by same fixed function units. In future DirectX versions you are likely able to submit alu heavy work to a separate command list and run it in parallel to shadow map rendering (GPU fetching from both comman lists, and doing context switching / shader unit allocation as it best determines). The easiest way to fill all the units all the time would be to process multiple frames simultaneously (added latency). For example you start a frame by rendering your shadow maps (vertex/rop/triangle heavy), and you are simultaneously doing deferred lighting and post processing for your previous frame (alu/tex/bw heavy). That would provide very good hardware utilization, but add half frame of extra latency. Or course if would also be more complex for the programmer to handle (manual load balancing between multiple shaders). And it all depends also on how well the GPU can split it's resources. Some resources are tied to other resources. For example in Kepler, the geometry units are in the shader multiprocessors. If you need lots of shader multiprocessors for pixel crunching (lighting and post processing), you also occupy the geometry units, and thus they cannot be used simultaneously for another (geometry heavy) kernel. |
|
|
|
|
|
|
#46 |
|
Regular
|
http://forums.create.msdn.com/forums/t/106060.aspx
There's even a question mark over whether D3D11.1 will run on W7 (and Vista, presumably). What are the chances of those OSs running D3D12? If that doesn't happen maybe we can just forget about D3D12 entirely. How's OpenGL shaping up?... Is it going to catch up? Sorry for another de-rail subject.
__________________
Can it play WoW? |
|
|
|
|
|
#47 | |
|
Senior Member
|
Quote:
|
|
|
|
|
|
|
#48 | |
|
Senior Member
|
Quote:
|
|
|
|
|
|
|
#49 |
|
Senior Member
Join Date: Mar 2008
Posts: 5,158
|
what about daughter cards for more memory ? I'd love to see a board that would connect to future video cards. DDR 3 is extremely cheap and is pretty fast. Connecting say 16 to 32 gigs of ddr 3 1866 . Each channel should give 15 gigs of bandwidth. So dual channel should be 30gigs and quad would be 60 gigs . That be a nice boost for games and with 16-32 gigs it would fit most games inside of it.
I don't know if this would have to be included in the dx specs . |
|
|
|
|
|
#50 | |
|
Senior Member
|
AMD: There Won’t be DirectX 12!
Quote:
__________________
Apple: China -- Brutal leadership done right.
Google: United States -- Somewhat democratic. Microsoft: Russia -- Big and bloated. Linux: EU -- Diverse and broke. |
|
|
|
|
![]() |
| Thread Tools | |
| Display Modes | |
|
|