DX11 vs DX12

Discussion in '3D Hardware, Software & Output Devices' started by iroboto, Jan 15, 2015.

  1. iroboto

    iroboto Daft Funk
    Legend Subscriber

    Joined:
    Mar 6, 2014
    Messages:
    14,833
    Likes Received:
    18,632
    Location:
    The North
    Lol I'm not sure that is fair.

    Wrt the draw call picture above: I'm sure it's more than just multiple draw calls happening there. They probably enabled some dx12 specific features to also assist.
     
  2. pharma

    Veteran

    Joined:
    Mar 29, 2004
    Messages:
    4,887
    Likes Received:
    4,534
    Mission accomplished with the exception that their major competitors (CPU & GPU) will be now be able to experience similar performance advances that were only available with AMD hardware.

    I find this comment from the presentation link above interesting from the standpoint of who IS not mentioned.

     
  3. Andrew Lauritzen

    Andrew Lauritzen Moderator
    Moderator Veteran

    Joined:
    May 21, 2004
    Messages:
    2,629
    Likes Received:
    1,227
    Location:
    British Columbia, Canada
    Hey let's not pretend anyone else got it "right" until fairly recent history either (first would probably be PS3, but it's obviously a simpler problem on a fixed platform). :) And to be fair, if you go back more than a few years there were significant GPU hardware limits that prevented this level of implementation/performance.

    Still, I'm excited to see it all coming together :)
     
  4. iroboto

    iroboto Daft Funk
    Legend Subscriber

    Joined:
    Mar 6, 2014
    Messages:
    14,833
    Likes Received:
    18,632
    Location:
    The North
  5. silent_guy

    Veteran Subscriber

    Joined:
    Mar 7, 2006
    Messages:
    3,754
    Likes Received:
    1,382
    There is some evidence that AMD drivers are more susceptible to CPU performance than Nvidia drivers. This is a problem for their APUs that have low performing CPUs.

    You can either fix that by making major changes to the existing driver, which is still a bit of a stop gap that doesn't solve everything and doesn't have nearly as much marketing potential, or solve the problem once and for all, reap marketing benefits for almost 2 years, and end up in a situation where, yes, the competitors also got faster, but where the gap has narrowed because you started from a worse position.

    So even with Mantle dying slowly, it must have been a net positive. They came out looking like the innovative guys and some even bought their 'open' story.

    Yeah, that wasn't very subtle. :wink:
     
  6. Jawed

    Legend

    Joined:
    Oct 2, 2004
    Messages:
    11,708
    Likes Received:
    2,132
    Location:
    London
    I'd say there's more difference between AMD and NVidia GPUs than there is between these two APIs.

    I don't think Mantle has left beta has it? Maybe it never will.

    Well, AMD has had years now to make Mantle work, it's climbed a lot of the curve already. GPU hardware always has elements that lead significant D3D iterations (witness: OpenGL extensions), so I don't doubt that both AMD and NVidia have had strong ideas about what's in D3D12.

    AMD still doesn't seem to have got to grips with shader compilation for GCN, so I'm hardly going to sing its praises. But it's not as if they're starting from zero.

    I agree, 5 APIs (D3D11, 12, Mantle, OpenGL, OpenCL) and substantial support for the people writing open source drivers for AMD GPUs is a bit of a spread. Whether Mantle, per se, is actually much of a dilution in such a wide range of software stacks could be argued.

    But yeah, I don't think Mantle will live for long. Unless D3D12 turns out broken, which is basically what happened with D3D10.
     
  7. Alexko

    Veteran Subscriber

    Joined:
    Aug 31, 2009
    Messages:
    4,541
    Likes Received:
    964
    To be fair, you could have made the same argument about CUDA when OpenCL was introduced, yet the former is still alive and well.
     
  8. silent_guy

    Veteran Subscriber

    Joined:
    Mar 7, 2006
    Messages:
    3,754
    Likes Received:
    1,382
    Very likely. And there's a good chance that there will be considerable reuse between Mantle and DX12. But there's obviously going to be unique code as well.

    I don't think it has. Maybe it's on purpose, to prevent weaning off too many people from Mantle in the future...
     
  9. silent_guy

    Veteran Subscriber

    Joined:
    Mar 7, 2006
    Messages:
    3,754
    Likes Received:
    1,382
    OCL is Nvidia's ugly stepchild, but, yes: I may be completely wrong about all of this. :wink:
     
  10. Kaotik

    Kaotik Drunk Member
    Legend

    Joined:
    Apr 16, 2003
    Messages:
    10,244
    Likes Received:
    4,462
    Location:
    Finland
    Mantle hasn't left the beta even though it was scheduled to do so last year
     
  11. MJP

    MJP
    Regular

    Joined:
    Feb 21, 2007
    Messages:
    566
    Likes Received:
    187
    Location:
    Irvine, CA
    Actually pretty much all of those features are supported on some existing GPU, it's just that none of them are exposed through D3D11. Intel GPU's already support PixelSync, Nvidia Maxwell GPU's support conservative rasterization, and I'm pretty sure that all recent GPU's can handle typed UAV loads. However it's totally possible that there's no current GPU that actually supports all of the new functionality, so we might have to wait for that.
     
    Grall, liquidboy and iroboto like this.
  12. madyasiwi

    Newcomer

    Joined:
    Oct 7, 2008
    Messages:
    194
    Likes Received:
    32
    I am under the impression that Maxwell's ROV has similar objective as Intel's Pixel Sync.

    Support burden aside, apparently there is enough difference that DX 12 seemingly performing fine across multiple vendors where as Mantle (games) need to be tailored between different SKUs from just one vendor to gain any benefit over DX 11.

    Too low level?

    At least that was how it's happened during the release of R9 285: Anandtech - Mantle Teething Problems
     
  13. Jawed

    Legend

    Joined:
    Oct 2, 2004
    Messages:
    11,708
    Likes Received:
    2,132
    Location:
    London
    Still seems to be happening:

    http://www.hardocp.com/article/2015...x_960_gaming_video_card_review/5#.VMOoIulyY_w
     
  14. sebbbi

    Veteran

    Joined:
    Nov 14, 2007
    Messages:
    2,924
    Likes Received:
    5,296
    Location:
    Helsinki, Finland
    Xbox 360 also had a low level GPU API with fully manual resource management. PS2 also had once, but obviously without shaders (= quite a different problem set).
    I expected the scalar ALU to help the compiler a lot (especially as the GCN execution style makes the practical latency of most instructions 1, making it dead simple to schedule ALU instructions). GPR pressure seems to be a bigger problem compared to the previous VLIW architectures, since the GPU needs more simultaneous threads on fly to hide latency. The compiler is wasting GPRs a lot. We discussed about the hacks to make the compiler behave better in the other thread, but I firmly believe that the compiler should handle GPRs allocation better. Sometimes I don't understand why it does some unbelievably stupid things. I would be very happy if we got [isolate] attribute to HLSL. At least I could manually solve the GPR allocation issues (have full control over the register life time when needed). There are already HLSL attributes like [branch], [flatten] and [unroll] to let the developer to solve branching issues on GPUs. Branches on GPUs need special care, but bad register allocation is often even a bigger performance problem (that is hard to solve without perfect compilers or manual attributes like [isolate]).
    DX12 supported platforms are not yet known. The biggest problem with DX10 adoption was the Vista requirement.

    Obviously DX10 didn't bring that many useful new features either. Geometry shader was good in theory, but it is still slow even in the newest GPUs (especially on AMD). Stream out wasn't flexible enough and often the single pass techniques beat it in performance. DrawAuto was basically useless (way too limited). Integer operations in shaders were nice to have, but again, even on current GPUs it's still often faster to emulate integer math with float math (if you don't need more than 24 bits). Float multiply-add can be used as (a single cycle) shift + insert. For example my floating point based realtime DXT compressor beats the integer version. Also integers operations are slower on consumer Nvidia hardware (AMD is quite good, only int mul is 4 cycles, most other ops are 1 cycles).

    DX11 fixed most of the DX10 issues: Tessellation is much faster than geometry shaders (it is actually fast in simple things like particle quad creation), compute shaders and append buffers provide both performance and flexibility boost over stream out. Indirect draw is DrawAuto done right, and you can setup it nicely with compute shaders. Integer operations are finally important, since you need them for compute shader address calculations (among other things). The bad binding model (that was supposed to be closer to hardware, but failed) is still there, but DX12 will finally solve that.
    Except that CUDA had good debugging tools and full C++ integration when OpenCL debuted. With OpenCL you had to write your shaders in a custom language (that didn't support templates or any other modern language features). Shader code had to be placed in separate text files or embedded in the C++ code files as string literals (yuck!). There were no debuggers available. Try to develop a complex compute shader without shader single stepping and variable inspect... Not at all productive.

    OpenCL 2.0 finally added work group scan/reduce/vote and nested parallelism (device side kernel enqueue). These are very important additions and will allow OpenCL 2.0 to catch up with CUDA in performance (CUDA has had similar lower level functionalities for long time). OpenCL 2.0 still doesn't have a standard C++ compatible shader language (that can be fully embedded in the C++ code). There are third party extensions for OpenCL that make it integrate better with C++, and SPIR will make it even easier in the future. Debuggers are also now better. A big downside however is that Nvidia doesn't have OpenCL 2.0 drivers (only Intel and AMD do), so you still need to program both CUDA and OpenCL 2.0 code paths when making consumer software.
     
  15. Jawed

    Legend

    Joined:
    Oct 2, 2004
    Messages:
    11,708
    Likes Received:
    2,132
    Location:
    London
    NVidia has a new compute-dedicated GPU with substantially larger register files.

    It seems to me the easiest fix for AMD's compiler problems is larger register files.

    If necessary, revise the architecture so that the maximum number of threads is only 8 per ALU, by having 4 copies of the current register file per VALU, but each can only support 2 hardware threads (there has to be some compromise for a much larger register file...). That way, a thread with a register allocation of 150 or 180 (as I often see) will still get 4 threads per VALU.

    Let me just say that the stupidity is more epic than that: using CodeXL to capture run-time OpenCL kernel compiles, I've discovered that compilation varies randomly (including VGPR allocation). I've been saying it for years now, but the JIT mentality that lies at the heart of this needs to be ditched.

    Another aspect of this is that the compiler just seems to give up when code complexity reaches a certain level. It seems as if it windows the code it can optimise and stuff that falls across a boundary is just broken.

    I suppose MS could deploy [isolate] on XBox 360 because it owned compilation from source to binary. It doesn't on PC. [isolate] might just be a step too far in terms of cross-IHV relations.

    Or, come to think of it, actually impossible unless each IHV implements it directly.

    To be fair, though, SM4.0 was a badly needed revolution.

    Doesn't SM5 have int24 arithmetic? Or is it the lack of int fma24, specifically, that you're alluding to?
     
  16. sebbbi

    Veteran

    Joined:
    Nov 14, 2007
    Messages:
    2,924
    Likes Received:
    5,296
    Location:
    Helsinki, Finland
    +
    That would be good. However, in addition, I still demand a good compiler that doesn't waste my GPRs.
    That sounds horrible... We often micro-optimize our shaders to meet certain GPR targets (exact numbers based on the GPU occupancy chart). One additional GPR often drops the performance quite a lot (since it reduces the occupancy immediately for GPR optimized shaders).
    Yes, GCN is very good in integer math. It supports both int24 mul and int24 fma (full rate) and also full rate combined shift + mask (insert & extract bits). Same cannot be said about the old DX10 GPUs. And Nvidia still seems to cripple their consumer GPUs in integer performance (not as hard as they cripple FP64). I do like integer math nowadays on GPUs (especially in compute shaders). Address calculation (indexing) with floating point and point sampling was hideous to get right (pixel centers were different in DX9 and DX10 so you had to write multiple code paths as well). In PC DX9 there was no sampling (or load) instruction that returned integers (or unnormalized floats), the results were always normalized to [0,1] range, so there were always precision issues. With normalized numbers integer value of N-1 is scaled to 1.0 (N is a power of two). This is always a lossy operation (as floats are base 2). Multiplying the value with N-1 and rounding to nearest fixes the issue, but is messy. Point sampling with +0.5 bias also works properly (but only if the pixel center is correct... I don't even remember which one was which). I must admit that DirectX 10 brought something very good: the load instruction (with integer index) and it brought the unnormalized integer return types for data loading. Unfortunately neither DX10 or DX11 did bring unnormalized integers return types to the sample instruction (this is a perfectly valid use case when point sampling is used). Also DX10/DX11 load instruction does not support mips.
     
  17. Max McMullen

    Newcomer

    Joined:
    Apr 4, 2014
    Messages:
    20
    Likes Received:
    106
    Location:
    Seattle, WA
    Catching up on Beyond3D posts.... load supports mips and array indices:

    https://msdn.microsoft.com/en-us/library/windows/desktop/bb509694(v=vs.85).aspx

    Unnormalized integer return types from sample were not implemented because hardware vendors couldn't guarantee the LSBs passed unmodified through the sampler even when running in point sampling. This may have been improved in the time between the D3D10 hardware spec and now, my team will revisit this with hardware vendors.

    Thanks!

    Max McMullen
    Direct3D Development Lead
    Microsoft
     
    Grall, mosen, liquidboy and 1 other person like this.
  18. Jawed

    Legend

    Joined:
    Oct 2, 2004
    Messages:
    11,708
    Likes Received:
    2,132
    Location:
    London
    My mind boggles at the idea that texturing hardware can't pass through 32-bit data unmolested. What is this stupid hardware? Some nasty Intel stuff from way back when? What the hell was Microsoft doing pandering to shit like that?
     
  19. homerdog

    homerdog donator of the year
    Legend Subscriber

    Joined:
    Jul 25, 2008
    Messages:
    6,294
    Likes Received:
    1,075
    Location:
    still camping with a mauler
    I have to think that modern hardware (even Intel) is able do that...
     
  20. Andrew Lauritzen

    Andrew Lauritzen Moderator
    Moderator Veteran

    Joined:
    May 21, 2004
    Messages:
    2,629
    Likes Received:
    1,227
    Location:
    British Columbia, Canada
    Yeah, I don't think this would have been a problem on any Intel DX10+ systems, but I didn't check specifically.

    It's not actually that crazy though Jawed... before UAV's and general purpose memory paths you only have texturing and ROP hardware. Both of those have "blend functions" attached to the end of them with weights and such. Remember that integers are still a relatively recent thing for GPUs and DX10 is pretty old now...

    But yes, obviously all modern chips should have general purpose memory data-paths with regular addresses and no "filtering/blending". Now if only we didn't have to statically allocate register file space... but that's another discussion ;)
     
Loading...

Share This Page

  • About Us

    Beyond3D has been around for over a decade and prides itself on being the best place on the web for in-depth, technically-driven discussion and analysis of 3D graphics hardware. If you love pixels and transistors, you've come to the right place!

    Beyond3D is proudly published by GPU Tools Ltd.
Loading...