Direct3D feature levels discussion

Fermi and Kepler do not support FL 11_1. There is no such thing as "almost supporting". If you ask DirectX 11 API for 11_1 feature level device it gives you an error on Fermi and Kepler. You must create a FL 11_0 device to make you program work on Fermi and Kepler.
Yes, but they are increasing UAV support right in D3D12
nVidia was leading in DX9 feature race with FX 5000? Too bad you can not actually use it because performance was atrocious :LOL:
SM 2.0a (GeForce 5/FX) was a little improvement on SM 2.0 (ATI 9500-9800), but then ATI superseded with SM 2.0b (which was not a real improvement on SM2.0a), then NVIDIA comes with SM 3.0 and ATI did not full implement it (missin Vertex Texture Fetch ?).... Yeah, a great addition to DX9 cap bits madness.
 
AMDs asyncronous compute implementation is also very good, as the fully bindless nature of their GPU means that the CUs can do very fine grained simultaneous execution of multiple shaders. Don't get fooled by the maximum amount of compute queues (shown by some review sites). Big numbers don't tell anything about the performance. Usually running two tasks simultaneously gives the best performance. Running significantly more just trashes the data and instruction caches.

This is really interesting and would imply there's minimal advantage to GCN1.1's increased number of ACE's and compute queues over GCN1.0 - at least for gaming.

You may not be allowed to answer this one but can you shed any light on how the Maxwell 2.0s async compute implementation compares to GCN in terms of performance / capability? I know you've touched on it above but I wasn't sure if that was specifically in relation to Maxwell 2 or older NV architectures?
 
As you failed to state in your reply, Nvidia is also currently the only one supporting conservative rasterization and ROV’s as features. Many would find this of much more relevant than the differences between resource binding(tier 2 vs tier 3) as a feature, but I guess that depends on which "camp" you sit and who will be providing programming "support" for your game development efforts in the near future.
DX 12.1 = ROV + conservative rasterization (+ volume tiled resources). I didn't mention these features separately as these features define DX 12.1.

These three are very nice features indeed and allow completely new kinds of approaches to some graphics problems. The problem is that these new approaches are very different from the existing ones, and need big changes to engine architecture. Both next gen consoles are GCN based. Rendering pipelines will be designed around GCN feature set for the following years. Games are not yet even exploiting the GCN feature set fully. Features missing compared to GCN (such as tier 3 resource binding) are likely going to be more relevant in practice than extra (future looking) features.

Hopefully some small indie developer releases a stunning voxel based (PC only) game that requires DX 12.1 (all the new DX 12.1 features are perfect for voxel rendering). This kind of killer app would make DX 12.1 much more relevant. Big studios would never release a game targeting less than 5% of the PC market (and skip the consoles completely).

The only DX 12.1 feature that is easy to plug into most existing engines is ROV. It can used to implement efficient OIT. However ROV is highly useful for many other purposes (especially with CR). ROV is currently supported by Intel Haswell and Broadwell and NVIDIA Maxwell 2. It has the broadest support of all DX 12.1 mandatory features. I expect to see games supporting this feature, even if the rest of the DX 12.1 feature set deems to be too forward looking for cross platform games.
 
Last edited:
Both next gen consoles are GCN based. Rendering pipelines will be designed around GCN feature set for the following years. Games are not yet even exploiting the GCN feature set fully. Features missing compared to GCN (such as tier 3 resource binding) are likely going to be more relevant in practice than extra (future looking) features.

If consoles are going to dictate engine capabilities for a decade at a time then we're basically screwed and should give up on PC gaming.

The "future" comes much faster for PC gamers so being held hostage by obsolete console hardware is unfortunate.
 
If consoles are going to dictate engine capabilities for a decade at a time then we're basically screwed and should give up on PC gaming.

The "future" comes much faster for PC gamers so being held hostage by obsolete console hardware is unfortunate.
Next gen consoles have helped PC games a lot. 64 bit executables are becoming the norm (consoles only run 64 bit code). Finally games can use more than 4GB of memory on PC. Compute shaders are finally used by all games and for many purposes. Tessellation is also becoming a standard feature (not just a random high end additional feature that only works with some assets, such as brick walls). Games are starting to use the new DX11 compressed texture formats (higher texture quality). Finally the DX11 feature set of the PC GPUs is fully used. Next gen consoles also ensure that DX 12.0 feature set will be fully used as soon as DX 12 is available.
 
Next gen consoles have helped PC games a lot.

I don't think we should heap praise on a platform for merely catching up after holding back PC gaming for many years.

But as you said, the market determines where big publishers spend money. And unfortunately that's not at the high end of the PC market.

We can only hope that enterprising developers and PC IHVs continue to push the envelope despite the console albatross around our collective necks.
 
Hopefully some small indie developer releases a stunning voxel based (PC only) game that requires DX 12.1 (all the new DX 12.1 features are perfect for voxel rendering). This kind of killer app would make DX 12.1 much more relevant.

You should do it as a side project sebbbi! Get NV to give you some funding - i'm sure they would for a showcase game that's exclusive to their latest GPU line. It wouldn't have to be some crazy budget open world game or FPS, a platformer or puzzle game would do fine I'm sure as long as it blows everything else away graphically, you know just about everyone with a Maxwell 2 GPU would buy it.
 
You should do it as a side project sebbbi! Get NV to give you some funding - i'm sure they would for a showcase game that's exclusive to their latest GPU line. It wouldn't have to be some crazy budget open world game or FPS, a platformer or puzzle game would do fine I'm sure as long as it blows everything else away graphically, you know just about everyone with a Maxwell 2 GPU would buy it.
There is a nice niche open. Given so many high end PC gamers are crying for something... to push the visual envelope of their games not just higher resolution, like Crysis did to good PC sales back in 2007. And given Unreal4 engines now has AAA quality procedurally generated forests and landscapes and that % royalty deal for indie developers, sebbi should kickstart or make a something for high end gamers. The market is there. The "superior pc mustard race" has been praying since Crysis 2's consolization for something... anything.... not just 3dmark benchmarks and nvidia tech demos to push our +$300 videocards down to 30fps @ 1080p.
 
Wide SIMD has bigger benefit from a scalar unit. The scalar unit can be used to offload register pressure of wave invariant data and ALU of wave invariant math. Wave invariant data stored in scalar register uses 64x less register storage space. This more than compensates the slightly increased register pressure of the wide SIMD.
This is not a benefit that is directly related to wide SIMD. For instance, Intel can issue instructions of different SIMD widths (including scalar) that index the register file in fairly arbitrary ways; there's never a need to store scalars as N-wide "vector registers".

Good automatic scalarization is not an impossible problem to solve. There are several good papers about this topic. Of course, if the shader language helped with this (for example OpenGL 2.0 subgroup operations), it would be much easier for the compiler to do a perfect job.
Man do not get me started on how incredibly messed up the whole shading language/OCL 2.0 situation is ;) "Automatic scalarization"? Please! Or we could just admit that the execution model of compute has some basic flaws and fix them ;)

Keeping resource descriptors in vector registers would be a big hit if the resource descriptor is 256 bits (it would take 8 registers per lane). Even if you can store a resource index to 20 bits, it still needs a vector register per lane (assuming you have no other place to store it).
I don't see why you would ever do it that way... the notion that there are "vector vs. scalar registers and instructions" isn't a requirement, that's just how GCN works. I don't think what you're saying is even true on NVIDIA, and it definitely isn't true on Intel.

But unfortunately using an index to memory would mean that the shader cannot create a descriptor out of thin air (programmatically). Thus sampling always needs at least one indirection. This is not a problem with current games, but there is a clear trend towards larger object counts with higher variety of textures, meaning that the resource descriptor loads will miss the cache more in the future.
Programmatic creation of descriptors is indeed an advantage of GCN, but I haven't been able to come up with too many cases where it is super-useful. In most of the cases where it's convenient there are few enough permutations that you can effectively just encode the relevant bits and index a few pre-created descriptors in the heap, especially with separate samplers/textures. Again that's not to say that there's anything wrong with that flexibility - and I absolutely would have fundamentally exposed this in Mantle if it was me designing it :) - but I haven't come up with any killer use cases yet. In any case the static samplers stuff in DX12 should help GCN a bit and since it's something that I specifically got folks onboard for, I'm still waiting for my thank-you from AMD ;)

Regarding caches it doesn't really matter ultimately whether the descriptor cache is on the texture sampler, the execution units or both. "More variety" and more cache misses favors neither architecture here really.

Anyways, I actually person believe that GPUs are going to have to start to pay back some of the mortgages they have taken in parallelism in the near future (I mean come on, it's pathetic that modern dGPUs need 4k to show real scaling) so IMHO it's not impossible that AMD need to revisit whether SIMD64 really is the best trade-off. As I said, either design point is fine; I'm just unconvinced that one of the two is clearly better.

... but I guess that depends on which "camp" you sit and who will be providing programming "support" for your game development efforts in the near future.
That sort of accusation towards a dev is completely uncalled for, and isn't acceptable here.
 
That sort of accusation towards a dev is completely uncalled for, and isn't acceptable here.

I think Dev's will get a head start on using the most advanced DX12 features that their "sponsors" cards can support, whether that involves techniques to utilize resource binding( tier 2 or tier 3), CR, ROV’s, etc ... I'm fairly certain Nvidia development groups are less interested about the benefits of resource binding tier 3 and more about enhancements brought about by Maxwell's 2 support of Conservative Rasterization, ROV’s, etc.
ARK: Survival Evolved is one title that will be using DX12 code techniques this summer with their PC debut, and Console support coming in 2016. No doubt initially it will be a very small percentage of gaming PC's that will see any benefit from this, but developer efforts like this will help educate consumers on checklist features to look for on their next graphics card purchase.
 
Pixel said:
There is a nice niche open. Given so many of the high end PC gamers are crying for something...
The thing is there is no "many high end PC gamers". Plus PC is niche in AAA gaming.
Crysis1 sold over a million copies on PC. The very 1st sentence which you cut out was it was a niche let me add that back in for you. And by so many I meant so many "OF" the high end pc gamers.
I'm not talking about making an entire game off the sales from high end gamers, reread my post.
 
Crysis1 sold over a million copies on PC. The very 1st sentence which you cut out was it was a niche let me add that back in for you. And by so many I meant so many "OF" the high end pc gamers.
I'm not talking about making an entire game off the sales from high end gamers, reread my post.
Crytek did not recoup Crysis 1 costs...
1 million is not enough for AAA graphics powerhouse.

Most if AMD and Intel iGPs also are paired with a dedicated GPUs, so the amount of "good" PC system is higher.
Most DX11 PCs have worse videocard than consoles.
 
I don't think we should heap praise on a platform for merely catching up after holding back PC gaming for many years.
Consoles have been leading the PC in GPU feature set for almost 1.5 years now, and are still leading today. PC GPUs have had some advanced features, but there hasn't been any API support for those features. DirectX 12 will allow PC to catch up, but it is still today in technical preview state.

According to the latest Steam Survey, the amount of 12.1 capable GPUs is almost non-existent right now. Hopefully 12.1 support will reach 1%-2% for the Christmas. Still it would be a little bit too early to say that consoles are holding PC back. Especially since the consoles have some unique GPU features that are not even exposed in DirectX 12.1. Hopefully DirectX 12.2 will expose some of them.
You should do it as a side project sebbbi! Get NV to give you some funding - i'm sure they would for a showcase game that's exclusive to their latest GPU line. It wouldn't have to be some crazy budget open world game or FPS, a platformer or puzzle game would do fine I'm sure as long as it blows everything else away graphically, you know just about everyone with a Maxwell 2 GPU would buy it.
Volume tiled resources are great for distance fields (because the texture filtering works properly for sparse distance fields). I have a little nitpick however. I dislike the cube shape for DXT compressed textures. BC4 is the best format for distance field, and it has 256x128x32 texel tiles (64x32 DXT blocks with 32 depth slices). I would have preferred 32x32 DXT blocks with 64 depth slices = 128x128x64 pixels. That would have resulted in better locality AND lower memory consumption for sparse distance fields (since only tiles that include surfaces need to be stored at highest mip).
This is not a benefit that is directly related to wide SIMD. For instance, Intel can issue instructions of different SIMD widths (including scalar) that index the register file in fairly arbitrary ways; there's never a need to store scalars as N-wide "vector registers".
Oh yes, you can swizzle and broadcast (splat) values to/from a single SIMD lane to reduce register pressure of wave invariant data. I don't know the low level details of Intel (GPU) and NVIDIA instruction sets to determine how fast and efficient the swizzles / register indexing is. It seems that AMD has been quite successful in separating the scalar processing from the vector processing, and this has also reduced the need for arbitrary swizzles / register indexing by the vector units. AMD has been able to split the 64 wide SIMD to four independent 16 wide units, and split the 256 KB register file to four 64 KB register files (one per 16 lane SIMD). This separation should allow the smaller register files to be stored much closer to the execution units, reducing latency and power consumption.

The most common swizzle on GPU pixel shader code is the quad swizzle (it swizzles the four consecutive threads and is used to calculate the derivatives for mip mapping and anisotropic filtering in pixel shaders). This swizzle never crosses 16 lane boundary. Slide 42 in this presentation (http://www.slideshare.net/DevCentralAMD/gs4106-the-amd-gcn-architecture-a-crash-course-by-layla-mah) describes that there is a full crossbar for this particular operation. There are also three different 32 wide swizzles (slide 43). The document doesn't describe the latency of these swizzles (crossing the 16 lane boundary). Swizzles like this are handy for some compute shader constructs such as vote, broadcast and prefix sum. However these constructs are not time critical for most shaders (even for those shaders that need them).
Man do not get me started on how incredibly messed up the whole shading language/OCL 2.0 situation is ;) "Automatic scalarization"? Please! Or we could just admit that the execution model of compute has some basic flaws and fix them ;)
I fully agree that the compute execution model and languages need an overhaul. But let's not hijack the thread :)

I just would prefer that the compiler understands to emit scalar instructions (and scalar loads) when I divide the thread id by 64 and then index a data array by that value. We do a lot of processing in 8x8 tiles (for example this is our sub-tile size in the lighting compute shader). The scalar unit gives very nice performance boosts in these cases. If the compiler is not able to emit scalar instructions properly, we need manual control for it on PC.
I don't see why you would ever do it that way... the notion that there are "vector vs. scalar registers and instructions" isn't a requirement, that's just how GCN works. I don't think what you're saying is even true on NVIDIA, and it definitely isn't true on Intel.
Yes, if your hardware has fast wave wide crossbars (for swizzling), you don't need any replication. Wave wide crossbars are more efficient for less wide SIMD. However more connections always adds more wires and more power consumption. As I am just a coder, I don't know where the sweet spot lies.

The scalar unit (and scalar registers) bring more benefits than just reducing some vector register pressure (when storing resource descriptors). In general, it just seems like a big waste to perform operations that are not needed per lane for each lane (32x or 64x extra work). As the scalar unit itself is tiny, and the scalar register file is tiny, these can be much closer to each other. There are certainly big advantages for power consumption in running wave invariant code by the scalar unit (and accessing the scalar registers) compared to indexing the vector registers and executing the same instruction by the wide SIMD (with same data on all lanes).
In any case the static samplers stuff in DX12 should help GCN a bit and since it's something that I specifically got folks onboard for, I'm still waiting for my thank-you from AMD;)
I love static samplers. Thanks for pushing that forward :)

On Xbox 360 the microcode sampling instructions had static parameters for filtering and wrapping modes, etc (https://www.google.fi/url?sa=t&rct=j&q=&esrc=s&source=web&cd=5&cad=rja&uact=8&ved=0CD8QFjAEahUKEwiOp9a425HGAhVFhSwKHdz5AJU&url=http://www.davidbmoss.com/content/papers/Xbox%20360%E2%80%99s%20ATI%20Xenos%20GPU%20Shaders%20and%20Assembly%20(November%202008)%20-%20David%20Moss%20&%20Tom%20Lindeman.docx&ei=8MZ-VY7vJ8WKsgHc84OoCQ&usg=AFQjCNG3QHkibFCHQuSAUFC8VSBkpV8Mgw&sig2=zQna13mNNZt28zffNfpM9Q). This was super handy. You never had to define samplers on CPU side and set them to the pipeline. I have noticed that we never want to change a sampler from the CPU side. It is nice to finally see something similar on PC.
Regarding caches it doesn't really matter ultimately whether the descriptor cache is on the texture sampler, the execution units or both. "More variety" and more cache misses favors neither architecture here really.
I was thinking about a shader that stores the resource descriptors in the instruction stream. This ensures that the the resource descriptors never miss the cache (as the instruction stream is prefetched linearly and the resource descriptor and the sampling instruction share the same scalar/instruction cache line). This also handles long shaders with seldom taken control flows nicely. Indexing to memory doesn't handle this case well, since you have only two bad options: either store that seldomly read resource descriptor in the same cache line(s) with the others frequently used ones (partially unused cache lines loaded for common case), or store it elsewhere (a full cache miss guaranteed when it is needed). Not a big thing, but something worth considering.
Anyways, I actually person believe that GPUs are going to have to start to pay back some of the mortgages they have taken in parallelism in the near future (I mean come on, it's pathetic that modern dGPUs need 4k to show real scaling) so IMHO it's not impossible that AMD need to revisit whether SIMD64 really is the best trade-off. As I said, either design point is fine; I'm just unconvinced that one of the two is clearly better.
4K is a buzzword and so is virtual reality. Oculus and Morpheus need huge resolutions to look good. The current 1080p resolution looks like 480p (as the field of view is huge and the 1080p is split by two eyes). I would not personally consider buying an compute without a high DPI ("retina") display anymore.
Crysis1 sold over a million copies on PC.
One million unit sales for an AAA game is no longer enough. One million sales is considered a big flop (for an AAA game), and would put any studio in a bad financial situation.
 
Last edited:
Crysis1 sold over a million copies on PC.

More recent games are Witcher 3 which has sold 4 million copies (as of last week) of which 1.3 million were PC sales, and ARK: Survival Evolved which is currently the top Steam selling game (console debuts next year).

Now that’s definitely good news for everyone, especially those who were claiming that the PC version of The Witcher 3 sold poorly (based solely on Steamspy’s numbers).
http://www.dsogaming.com/news/the-w...-copies-1-3-million-were-from-the-pc-version/
 
Last edited:
I fully agree that the compute execution model and languages need an overhaul. But let's not hijack the thread :)
Agreed, I just can't manage to drum up any sympathy for some of these issues given the languages ;)

On Xbox 360 the microcode sampling instructions had static parameters for filtering and wrapping modes, etc ... I have noticed that we never want to change a sampler from the CPU side. It is nice to finally see something similar on PC.
Cool! Yeah a lot of the motivation was that for applications that were actually written for D3D's separate textures/samplers we basically saw 5-10 samplers total for most applications. There just aren't that many useful combinations really... Of course some applications are still designed with 1:1 textures/samplers and the odd one does have to change some parameter at higher frequencies but for the most part the parameters are highly redundant.

4K is a buzzword and so is virtual reality. Oculus and Morpheus need huge resolutions to look good. The current 1080p resolution looks like 480p (as the field of view is huge and the 1080p is split by two eyes).
VR currently uses tons of FLOPs but largely because it's very brute force. You don't actually need 16k per eye or whatever ridiculous resolution, you just need it in a very small location and the rest can be low res. Of course tracking that location with low enough latency is still largely an open problem but there's nothing fundamentally impossible about it and in the long run I don't think people are going to be happy with >3/4 of their GPU power being wasted in VR :)

In any case whether everyone moves to 4k or not is fairly irrelevant - at best it gives GPUs another few years before they need to take the problem seriously. But mark my words, you can't ignore IPC forever, even on GPUs. Good news is there's low hanging fruit there since it's something that they have largely ignored.

Anyways I'm straying a bit far off the original topic, but the discussion is interesting enough that it seems worth it :)
 
VR currently uses tons of FLOPs but largely because it's very brute force. You don't actually need 16k per eye or whatever ridiculous resolution, you just need it in a very small location and the rest can be low res. Of course tracking that location with low enough latency is still largely an open problem but there's nothing fundamentally impossible about it and in the long run I don't think people are going to be happy with >3/4 of their GPU power being wasted in VR :)

Some people seem to be working on that:

http://arstechnica.com/gaming/2015/05/vr-headset-company-fove-is-betting-on-eye-tracking-to-compete/
 
The thing is there is no "many high end PC gamers". Plus PC is niche in AAA gaming.

It depends what you consider "high end". According to Steam for example there are more PC's out there that are more powerful than the PS4 than their are XBO's.

Also, PC isn't really Niche in AAA gaming any more. There are lots of examples of AAA games on the PC sellng on par with or higher than the XBO versions.
 
Back
Top