I don't think we should heap praise on a platform for merely catching up after holding back PC gaming for many years.
Consoles have been leading the PC in GPU feature set for almost 1.5 years now, and are still leading today. PC GPUs have had some advanced features, but there hasn't been any API support for those features. DirectX 12 will allow PC to catch up, but it is still today in technical preview state.
According to the latest Steam Survey, the amount of 12.1 capable GPUs is almost non-existent right now. Hopefully 12.1 support will reach 1%-2% for the Christmas. Still it would be a little bit too early to say that consoles are holding PC back. Especially since the consoles have some unique GPU features that are not even exposed in DirectX 12.1. Hopefully DirectX 12.2 will expose some of them.
You should do it as a side project sebbbi! Get NV to give you some funding - i'm sure they would for a showcase game that's exclusive to their latest GPU line. It wouldn't have to be some crazy budget open world game or FPS, a platformer or puzzle game would do fine I'm sure as long as it blows everything else away graphically, you know just about everyone with a Maxwell 2 GPU would buy it.
Volume tiled resources are great for distance fields (because the texture filtering works properly for sparse distance fields). I have a little nitpick however. I dislike the cube shape for DXT compressed textures. BC4 is the best format for distance field, and it has 256x128x32 texel tiles (64x32 DXT blocks with 32 depth slices). I would have preferred 32x32 DXT blocks with 64 depth slices = 128x128x64 pixels. That would have resulted in better locality AND lower memory consumption for sparse distance fields (since only tiles that include surfaces need to be stored at highest mip).
This is not a benefit that is directly related to wide SIMD. For instance, Intel can issue instructions of different SIMD widths (including scalar) that index the register file in fairly arbitrary ways; there's never a need to store scalars as N-wide "vector registers".
Oh yes, you can swizzle and broadcast (splat) values to/from a single SIMD lane to reduce register pressure of wave invariant data. I don't know the low level details of Intel (GPU) and NVIDIA instruction sets to determine how fast and efficient the swizzles / register indexing is. It seems that AMD has been quite successful in separating the scalar processing from the vector processing, and this has also reduced the need for arbitrary swizzles / register indexing by the vector units. AMD has been able to split the 64 wide SIMD to four independent 16 wide units, and split the 256 KB register file to four 64 KB register files (one per 16 lane SIMD). This separation should allow the smaller register files to be stored much closer to the execution units, reducing latency and power consumption.
The most common swizzle on GPU pixel shader code is the quad swizzle (it swizzles the four consecutive threads and is used to calculate the derivatives for mip mapping and anisotropic filtering in pixel shaders). This swizzle never crosses 16 lane boundary. Slide 42 in this presentation (
http://www.slideshare.net/DevCentralAMD/gs4106-the-amd-gcn-architecture-a-crash-course-by-layla-mah) describes that there is a full crossbar for this particular operation. There are also three different 32 wide swizzles (slide 43). The document doesn't describe the latency of these swizzles (crossing the 16 lane boundary). Swizzles like this are handy for some compute shader constructs such as vote, broadcast and prefix sum. However these constructs are not time critical for most shaders (even for those shaders that need them).
Man do not get me started on how incredibly messed up the whole shading language/OCL 2.0 situation is
"Automatic scalarization"? Please! Or we could just admit that the execution model of compute has some basic flaws and fix them
I fully agree that the compute execution model and languages need an overhaul. But let's not hijack the thread
I just would prefer that the compiler understands to emit scalar instructions (and scalar loads) when I divide the thread id by 64 and then index a data array by that value. We do a lot of processing in 8x8 tiles (for example this is our sub-tile size in the lighting compute shader). The scalar unit gives very nice performance boosts in these cases. If the compiler is not able to emit scalar instructions properly, we need manual control for it on PC.
I don't see why you would ever do it that way... the notion that there are "vector vs. scalar registers and instructions" isn't a requirement, that's just how GCN works. I don't think what you're saying is even true on NVIDIA, and it definitely isn't true on Intel.
Yes, if your hardware has fast wave wide crossbars (for swizzling), you don't need any replication. Wave wide crossbars are more efficient for less wide SIMD. However more connections always adds more wires and more power consumption. As I am just a coder, I don't know where the sweet spot lies.
The scalar unit (and scalar registers) bring more benefits than just reducing some vector register pressure (when storing resource descriptors). In general, it just seems like a big waste to perform operations that are not needed per lane for each lane (32x or 64x extra work). As the scalar unit itself is tiny, and the scalar register file is tiny, these can be much closer to each other. There are certainly big advantages for power consumption in running wave invariant code by the scalar unit (and accessing the scalar registers) compared to indexing the vector registers and executing the same instruction by the wide SIMD (with same data on all lanes).
In any case the static samplers stuff in DX12 should help GCN a bit and since it's something that I specifically got folks onboard for, I'm still waiting for my thank-you from AMD
I love static samplers. Thanks for pushing that forward
On Xbox 360 the microcode sampling instructions had static parameters for filtering and wrapping modes, etc (
https://www.google.fi/url?sa=t&rct=j&q=&esrc=s&source=web&cd=5&cad=rja&uact=8&ved=0CD8QFjAEahUKEwiOp9a425HGAhVFhSwKHdz5AJU&url=http://www.davidbmoss.com/content/papers/Xbox%20360%E2%80%99s%20ATI%20Xenos%20GPU%20Shaders%20and%20Assembly%20(November%202008)%20-%20David%20Moss%20&%20Tom%20Lindeman.docx&ei=8MZ-VY7vJ8WKsgHc84OoCQ&usg=AFQjCNG3QHkibFCHQuSAUFC8VSBkpV8Mgw&sig2=zQna13mNNZt28zffNfpM9Q). This was super handy. You never had to define samplers on CPU side and set them to the pipeline. I have noticed that we never want to change a sampler from the CPU side. It is nice to finally see something similar on PC.
Regarding caches it doesn't really matter ultimately whether the descriptor cache is on the texture sampler, the execution units or both. "More variety" and more cache misses favors neither architecture here really.
I was thinking about a shader that stores the resource descriptors in the instruction stream. This ensures that the the resource descriptors never miss the cache (as the instruction stream is prefetched linearly and the resource descriptor and the sampling instruction share the same scalar/instruction cache line). This also handles long shaders with seldom taken control flows nicely. Indexing to memory doesn't handle this case well, since you have only two bad options: either store that seldomly read resource descriptor in the same cache line(s) with the others frequently used ones (partially unused cache lines loaded for common case), or store it elsewhere (a full cache miss guaranteed when it is needed). Not a big thing, but something worth considering.
Anyways, I actually person believe that GPUs are going to have to start to pay back some of the mortgages they have taken in parallelism in the near future (I mean come on, it's pathetic that modern dGPUs need 4k to show real scaling) so IMHO it's not impossible that AMD need to revisit whether SIMD64 really is the best trade-off. As I said, either design point is fine; I'm just unconvinced that one of the two is clearly better.
4K is a buzzword and so is virtual reality. Oculus and Morpheus need huge resolutions to look good. The current 1080p resolution looks like 480p (as the field of view is huge and the 1080p is split by two eyes). I would not personally consider buying an compute without a high DPI ("retina") display anymore.
Crysis1 sold over a million copies on PC.
One million unit sales for an AAA game is no longer enough. One million sales is considered a big flop (for an AAA game), and would put any studio in a bad financial situation.