Direct3D feature levels discussion

sebbbi · Jun 15, 2015

Andrew Lauritzen said:
VR currently uses tons of FLOPs but largely because it's very brute force. You don't actually need 16k per eye or whatever ridiculous resolution, you just need it in a very small location and the rest can be low res. Of course tracking that location with low enough latency is still largely an open problem but there's nothing fundamentally impossible about it and in the long run I don't think people are going to be happy with >3/4 of their GPU power being wasted in VR

Foveated rendering would be nice. But eye tracking needs to be present for it to work perfectly.

Andrew Lauritzen said:
But mark my words, you can't ignore IPC forever, even on GPUs. Good news is there's low hanging fruit there since it's something that they have largely ignored.

64 wide SIMD needs scheduling (and instruction fetch & decode) hardware once per 64 threads. GPU with 16 wide ("narrow") SIMD needs to track, schedule, fetch & execute four times as many waves. Most IPC improvements make the scheduler & instruction fetch/decode hardware more complex. OoO for example would be cheaper to implement on 64 wide SIMD compared to 16 wide SIMD with similar throughput (as there are four times as many SIMD schedulers on the narrow SIMD GPU).

Intel CPUs recently moved from 4 wide SIMD to 8 wide, and the next big refresh is going to have 16 wide SIMD (AVX-512). It seems that the SIMD width is still growing. Is 16 wide going to be the sweet spot? Or is it 32 wide or 64 wide (or even wider)? CPUs at least show us that the wide SIMD amortizes the cost of the complex OoO hardware and complex scalar pipeline quite nicely. Wide SIMD seems to be working quite well with high IPC. GCN has a scalar unit as well, and it is going to be getting more complex in the future. Scalar unit can be already used to improve the IPC quite a bit by offloading repeated math (unfortunately I can't give you any exact numbers of our experiments). It is going to be interesting where this all leads.

lanek · Jun 15, 2015

I imagine you are speaking about Foveated real time raytracing by Takahiro Harada research ( OpenCL AMD )
http://research.lighttransport.com/foveated-real-time-ray-tracing-for-virtual-reality-headset/

Contain too here information about vector displacement mapping and ofc the research from Nvidia on gradient domain pathtracing.
http://raytracey.blogspot.fr/2015/05/brandnew-gpu-path-tracing-research-from.html

But yes ofc you need eyetracking for make it work as you need determine the fovea point.( or center of the eye ).

This said, im not sure it can be usable on "gaming" front right now, when ofc for CGI using VR its seems clearly usable .

Rikimaru · Jun 15, 2015

pjbliverpool said:
Also, PC isn't really Niche in AAA gaming any more. There are lots of examples of AAA games on the PC sellng on par with or higher than the XBO versions.

Every AAA I've seen sold much more on consoles than PC.

pjbliverpool · Jun 15, 2015

Rikimaru said:
Every AAA I've seen sold much more on consoles than PC.

Consoles aren't a single a market, they are a collection of markets, at least 2, as many as 5 depending on the game (excluding handhelds). You said earlier that PC's are a niche market when it comes to AAA's, but that would imply the PC market is much smaller than at least the biggest two console markets (which no-one would ever describe as niche). So if that's not the case, then it would make your statement incorrect.

http://www.dsogaming.com/news/repor...games-sell-better-on-the-pc-than-on-xbox-one/
http://wccftech.com/witcher-3-sells-13-million-copies-pc-4-million-total/

DmitryKo · Jun 15, 2015

pjbliverpool said:
According to Steam for example there are more PC's out there that are more powerful than the PS4 than there are XBO's.

Indeed, if you take my breakdown for the February 2015 Steam hardware survey and count how many GPUs

1) have at least 2 TFLOPs of single-precision floating point performance - then it's 23% of all D3D11 cards and 33% of all Direct3D 12 capable cards (this includes Nvidia GTX 590, 660+, 680M, 760+, TITAN, 780M, 870M+, 960+, TITAN X, 970M+ and AMD HD7870+, HD8800+, R9 200, HD 7970M, HD8970M, R9 M290X, R9 M295X); or

2) have at least 1.3 TFLOPs - then it's 32% of all D3D11 cards and 46% of all Direct3D 12 cards (add NVidia GTX 570, 580, 650 Ti/Boost, 675MX, 750 Ti, 765M, 770M, 860M, 960M, 965M and AMD HD7770 1GHz, HD7790, HD7850, HD8770, R7 200 series, R9 M280X) .

D3D11 cards make about 78% of all GPUs, Direct3D 12 capable cards make 69% of all D3D11 cards and thus 53% of all GPUs.

I'm not sure if it's viable to apply these stats to their claimed figure of 125 million Steam users, but anyway that's a surprisingly large number.

pjbliverpool said:
And even then there are still about 18 million that don't.

How do you come with this 18 million figure?

pjbliverpool · Jun 15, 2015

DmitryKo said:
How do you come with this 18 million figure?

It's similar to what you did above. I took all the GPU's above a certain performance level from the DX11 list (in this case it was everything from a 660Ti and 270x upwards), which amounted to 18% (discounting the "other" category). Then I took that as a percentage of all GPU's (14.5%) and then I multiplied that by the 125 million figure. I take your point around whether that's a representative number to use or not but Steam do claim they are 'active' accounts as opposed to all accounts, and while some people will have more than one account, which would inflate the number, Steam also doesn't cover the entire PC market so the number should be higher anyway. So I figured may as well assume it's a wash in the absence of any more solid information to go on.

trinibwoy · Jun 16, 2015

sebbbi said:
Consoles have been leading the PC in GPU feature set for almost 1.5 years now, and are still leading today. PC GPUs have had some advanced features, but there hasn't been any API support for those features. DirectX 12 will allow PC to catch up, but it is still today in technical preview state.

Amen to that. DirectX 12 can't get here fast enough.

According to the latest Steam Survey, the amount of 12.1 capable GPUs is almost non-existent right now. Hopefully 12.1 support will reach 1%-2% for the Christmas. Still it would be a little bit too early to say that consoles are holding PC back. Especially since the consoles have some unique GPU features that are not even exposed in DirectX 12.1. Hopefully DirectX 12.2 will expose some of them.

Console lifecycles are very long and we are just starting this generation. As promising as they seem today they will look old and dusty in a few years. We'll soon be back to the familiar scenario of cheap discrete GPUs far outpacing console hardware.

Hopefully this time around DirectX won't be an obstacle to maximizing the potential of the PC platform.

Silent_Buddha · Jun 16, 2015

Rikimaru said:
Every AAA I've seen sold much more on consoles than PC.

The only reason you see that is because the sales tracking companies can only track physical sales. Physical sales only make up ~10% of PC sales, while they make up ~80-90% of console sales.

So yes, 10% of PC sales are going to lag behing 80-90% of console sales.

When a company actually bothers to release unit sales, PC generally does as well or better than any single console.

Regards,
SB

3dilettante · Jun 16, 2015

sebbbi said:
With the new rumors about Fiji's improved scalar unit (memory stores, full instruction set, one scalar unit per CU), the GCN architecture seems to be moving even closer to throughput oriented in-order CPUs with wide SIMD.

I am aware of the scalar memory store capability, as that was added in Tonga and is part of the latest GCN ISA document. The other two items I have not seen mooted for Fiji, although we should know rather soon what, if any, tweaks it has over Tonga. Full instruction set I've seen as a theoretical, and one scalar unit is already the case. Did you mean something like the data cache (maybe?) being replicated per CU?

Knights Landing has a simple in-order scalar pipeline that handles branching and control flow (and uniform integer math and uniform loads/stores can be offloaded to it), 512 bit AVX (16 wide for 32 bit float), 4 way hyperthreading (GCN is 10 way). Similarities are striking.

From what I've seen, Knights Landing is a superscalar OoOE processor with a speculative execution and memory pipeline, far stronger memory model, precise exceptions for the integer domain--at a minimum, and it meets the level of rigor that permits each core to function as a host processor.

I am curious if Knights Landing's vector sections are listed separately due to some kind of separate scheduling or memory domain, which might provide something closer to a design parallel. The other design parallels, from SIMD width and the presence of more than one hardware thread do not seem much closer between the two compared to other SMT and SIMD implementations.

256 bit (32 bytes) resource descriptor is significant amount of data. However, you also need to send 64 UVs, 64*2*sizeof(float) = 512 bytes. The resource descriptor is only 6.25% of the data. Some sampling instructions also need mip level or gradient (further reducing the resource descriptor percentage). It is not that bad design call. It gives AMD lots of flexibility in the future.

If I may weigh in on one possible factor, going back to 2011 and the early days of GCN, was that AMD was actively trying to reduce the amount of hidden state for the CU's execution context. For me, Google only brings up a few references to this, with one my posts being one of them, oddly enough.
The texture path is an area where they had not fully exposed the CU's execution context to software, possibly related to the phase in the texturing process where texture accesses that require multiple samples are cracked into separate cache accesses and eventually returned as filtered values.

Explicit passing of descriptors to the vector memory path exposes what was once a separate collection of internal states. The GCN scalar unit itself might be more notable in that it is software-exposed, but there would have been a hardware analog running in the shadows before the clause model was abandoned.
In light of this goal, there is much less that is moving independently of a shader context for a compute architecture that had compute context-switching and pervasive virtualization as a design target. Pointer passing would mean there would be an engine doing who knows what if a context switch were ordered, and literal data passing doesn't need to worry about maintaining virtualization, since that was already handled by the explicit virtual memory system prior to it making it to the scalar registers.
A TMU descriptor engine without the necessary synchronization or translation hardware would be potentially destructive, whilst having one that elaborate would be more expensive. One potential reason for the way things are is that the scalar unit might have been such an engine prior to being exposed in GCN.
For AMD's specific compute goals, passing the descriptor data itself may have been necessary for their implementation needs on the path towards the FSA-now-HSA model they wanted.

Programmatically generating resource descriptors may have been a consequence they might have noted, although Mantle's choice to not go down that route may point to a level of disinterest or pitfalls in the technology.
It would be very different system behavior, which they may not have been able to validate, and which may have been too exotic compared to other architectures to get broad buy-in. There may be driver behaviors in Mantle that assumed too much of the old model persisted to allow shaders to generate another layer of dynamic state that could interact with existing states.
Another, going back to what Knight's Corner has that GCN does not, is that GCN has at least some FP exception tracking, but many other faults not vector-related are very imprecise. Basing Mantle 1.0 on programmatic descriptors built on hardware with a blind spot over its binding implementation may have been premature or possibly too binding at a low level for maintaining compatibility or changing implementations.

sebbbi said:
Don't get fooled by the maximum amount of compute queues (shown by some review sites).

I still don't get where the controversy is coming from on this. AMD's description of queues and the processors that manage them seems straightforward to me.

sebbbi said:
I was thinking about a shader that stores the resource descriptors in the instruction stream. This ensures that the the resource descriptors never miss the cache (as the instruction stream is prefetched linearly and the resource descriptor and the sampling instruction share the same scalar/instruction cache line).

The SALU's immediate handling might allow this, although the payload efficiency is not great and it might take some additional fiddling to prepare the destination registers.
It may not work with programmatically generated values. The GCN ISA docs don't seem to offer a clear avenue for dynamically changing code.

sebbbi · Jun 16, 2015

3dilettante said:
I am aware of the scalar memory store capability, as that was added in Tonga and is part of the latest GCN ISA document. The other two items I have not seen mooted for Fiji, although we should know rather soon what, if any, tweaks it has over Tonga. Full instruction set I've seen as a theoretical, and one scalar unit is already the case.

I just quoted the rumors. For my perspective Tonga and Fiji are both brand new "unreleased" products. Tonga is only available in a single GPU model (Radeon R9 285). It is not yet available in high end products.

3dilettante said:
From what I've seen, Knights Landing is a superscalar OoOE processor

Seems that I mixed the current Xeon Phi (Knights Corner) with the forthcoming one. The current one is in-order with 4 way HT per core and 512 bit wide vector units. The scalar core is based on old Pentium core (P54C). It seems that the fortcoming one is based on Airmont ATOM cores with OoO and 4 way HT per core.

Ethatron · Jun 16, 2015

3dilettante said:
Programmatically generating resource descriptors may have been a consequence they might have noted, although Mantle's choice to not go down that route may point to a level of disinterest or pitfalls in the technology.

Why do you say you can't create them on-the-fly in Mantle? AFAIAA all the different expressions of resource-handles are available in Mantle.

3dilettante · Jun 16, 2015

Ethatron said:
Why do you say you can't create them on-the-fly in Mantle? AFAIAA all the different expressions of resource-handles are available in Mantle.

I am aware of the mechanisms for creating objects in memory, hence the handles. The programmatic scenario is producing shader code that creates raw descriptor data directly in the scalar register set without touching memory. Since GCN passes it as raw data to its texture pipeline, how would it know the difference? I didn't interpret the descriptor functions for Mantle has having that option.

Andrew Lauritzen · Jun 16, 2015

Ethatron said:
Why do you say you can't create them on-the-fly in Mantle? AFAIAA all the different expressions of resource-handles are available in Mantle.

You'd need non-trivial modifications to the shading language to support shader generating descriptors (unless you want to go through memory, which defeats the purpose). AFAIK Mantle currently only supports ~HLSL with minor extensions.

Ethatron · Jun 16, 2015

Andrew Lauritzen said:
You'd need non-trivial modifications to the shading language to support shader generating descriptors (unless you want to go through memory, which defeats the purpose). AFAIK Mantle currently only supports ~HLSL with minor extensions.

I had the impression you can write shaders in GCN assembly.

Andrew Lauritzen · Jun 16, 2015

Ethatron said:
I had the impression you can write shaders in GCN assembly.

Perhaps via an "extension"? That would undercut the argument that they made all along that it was portable

Ethatron · Jun 16, 2015

Andrew Lauritzen said:
Perhaps via an "extension"? That would undercut the argument that they made all along that it was portable

It should be a different thing. In the shader/pipeline-in-a-object case we/you are responsable for compatibility. If I go to the metal, I assume nothing else. Like the various CPU instruction sets. And even if it wouldn't be an available offering, I'm absolutely sure AMD would cooperate if the need or strong opportunity arises.
Sometimes, for some algorithms or projects or platforms, you just throw rationality out of the window and go for the throne.
Actually I have to recommend TIS-100 now. Because this game simply recreates the same joy I feel sitting over production algorithms. It's pretty frightening at the same time.

3dilettante · Jun 17, 2015

The Mantle document has two mostly identical references to what is stored prior to compilation to hardware ISA:

Shader objects are used to represent code executing on programmable pipeline stages. The input
shaders in Mantle are specified in binary intermediate language (IL) format. The currently
supported intermediate language is a subset of AMD IL.
...
Shader objects, represented by GR_SHADER handles, are not directly used for rendering and are

never bound to a command buffer state. Their only purpose is to serve as helper objects for
pipeline creation. During the pipeline creation, shaders are converted to native GPU shader
instruction set architecture (ISA) along with the relevant shader state.

The most ready example that I could find was the 2011
AMD Intermediate Language (IL) Specification v2.4

A number of the resource definitions for GCN housed in the scalar registers show up as modifier tokens written out statically in the IL. It's a broad representation that I have not looked at in depth, so I could have missed a mechanism flexible enough to put the equivalent of a language modifier and house it in a register, or for that matter a large number of independent tokens housed in the manner defined for GCN's scalar register file.

Alessio1989 · Jun 20, 2015

[I got my hands on a surface pro 3:

Code:

ADAPTER 0
"Intel(R) HD Graphics Family"
VEN_8086, DEV_0A16, SUBSYS_00051414, REV_0B
Dedicated video memory : 134742016  bytes
Total video memory : 4294901760  bytes
Maximum feature level : D3D_FEATURE_LEVEL_11_0 (0xb000)
DoublePrecisionFloatShaderOps : 1
OutputMergerLogicOp : 0
MinPrecisionSupport : D3D12_SHADER_MIN_PRECISION_SUPPORT_NONE (0)
TiledResourcesTier : D3D12_TILED_RESOURCES_TIER_NOT_SUPPORTED (0)
ResourceBindingTier : D3D12_RESOURCE_BINDING_TIER_1 (1)
PSSpecifiedStencilRefSupported : 0
TypedUAVLoadAdditionalFormats : 1
ROVsSupported : 0
ConservativeRasterizationTier : D3D12_CONSERVATIVE_RASTERIZATION_TIER_NOT_SUPPORTED (0)
MaxGPUVirtualAddressBitsPerResource : 31
StandardSwizzle64KBSupported : 0
CrossNodeSharingTier : D3D12_CROSS_NODE_SHARING_TIER_NOT_SUPPORTED (0)
CrossAdapterRowMajorTextureSupported : 0
VPAndRTArrayIndexFromAnyShaderFeedingRasterizerSupportedWithoutGSEmulation : 1
ResourceHeapTier : D3D12_RESOURCE_HEAP_TIER_1 (1)
Adapter Node 0:     TileBasedRenderer: 0, UMA: 1, CacheCoherentUMA: 1

Looks like haswell supports typed UAVs (that's a good news!) and FP64 (not a big surprise though), while ROVs, TR1 and logical bland ops still missing...

Please not I am not in the position to comment how much till the final driver support.

EDIT...Wow, I am able to edit messages now ^_^.... got same result with a tool I coded by myself

Kaarlisk · Jun 20, 2015

It might be useful to note Windows build and GPU driver version in these reports.

Alessio1989 · Jun 20, 2015

build 10130 x64, driver 10.18.15.4204 (available via Windows Update)

Direct3D feature levels discussion

sebbbi

lanek

Rikimaru

pjbliverpool

B3D Scallywag

DmitryKo

pjbliverpool

B3D Scallywag

trinibwoy

Meh

Silent_Buddha

3dilettante

sebbbi

Ethatron

3dilettante

Andrew Lauritzen

Moderator

Ethatron

Andrew Lauritzen

Moderator

Ethatron

3dilettante

Alessio1989

Kaarlisk

Alessio1989