DirectX Developer Day 2020

Discussion in 'Rendering Technology and APIs' started by Kaotik, Mar 19, 2020.

  1. JoeJ

    Veteran

    Joined:
    Apr 1, 2018
    Messages:
    1,523
    Likes Received:
    1,772
    No. That's not what i mean. My rant is not directed specifically to MS, NV, AMD, Khronos, or game industry... probably i expressed myself a bit unfortunately.
    I want both NV and AMD to offer extensions to lift restrictions on compute. I'm fine if compatible hardware is rare because it will take ages until my work finds its way into product.
    I would be even fine with vendor APIs, replacing the need for Khronos and MS to find compromising solutions after lots of time.
    And i see myself as an outsider but also feel left behind, because my compute work is different from what we see in current games.

    I assumed so. VK has added conditional draw. It would be little work for AMD to extend this to support barriers.
    The same is it with NVs device generated command buffers extension for VK. I talked with it's developer here on the form, and he considered to add this. I have not checked if they already did an update, but being heard is good and all i want.
    I know it will take time - lots of it, so the earlier we request things the better.

    You may be right, which is why i mentioned the brute force mentality i see in games.
    Personally i could put those things to good use (probably). And i look a bit envy on offerings e.g. at OpenCL 2.0 (which also works on NV), while i assume this solution is not hardware friendly enough for games.
    (In my case the workloads in question are not massive - i would accept some performance costs for the gained flexibility which might result in a speed up at the end.)
    There certainly is some risk in assuming 'games don't need this'. Games might find good use for new functionality after it becomes available.

    Ok. That said, let me appologize for my offended reaction above. It's another case of heat for no reason i guess. ;)
     
    pharma and DavidGraham like this.
  2. Lurkmass

    Regular

    Joined:
    Mar 3, 2020
    Messages:
    565
    Likes Received:
    711
    It stopped serving that purpose when it wasn't able to hit "acceptable levels of performance" on other vendors. "Performance portability claims" need to be evaluated on a constant basis and sometimes they turn out to be false in the end.

    I am well aware of this just as much you are and as ultimately with any "black box" the drivers on PC APIs can 'pretend' to give the results the user wants underneath it all.

    I realize that a function call can get 'inlined' by the driver's shader compiler so that there's no cost of paying for the overhead of a function call but from what I recall this isn't true for Nvidia hardware on Volta or after based off what I heard from one of their engineers. It sounds like they actually use independent thread scheduling to keep track of the different function calls with their program counter per-lane.

    Also why depend on the compiler to get it right when I can be 'explicit' about such with the newer API ? If I can do it by myself instead of having the driver do it automatically for me then why not take advantage of this fact ? It's just more points in favour of DXR Tier 1.1 if this optimization can be performed and applied more often in the manual case ...

    We must have very different experiences then because OpenGL (even modern versions) in it's entirety is an objectively bad API abstraction on most modern hardware today aside from Nvidia hardware which has been traditionally modeled as a global state machine for a very long time so they implement Vulkan on top of OpenGL in their drivers but it would be the other way around on most vendors.

    Often times if a black box 'outlives' it's necessity, it rightfully gets 'discarded' ...
     
  3. willardjuice

    willardjuice super willyjuice
    Moderator Veteran Alpha

    Joined:
    May 14, 2005
    Messages:
    1,386
    Likes Received:
    299
    Location:
    NY
    Hey I want callable shaders just as much as you :wink:, but remember graphics hardware is like everything else in life, a zero sum game. To increase flexibility in one area we must lose some efficiency [relative performance] in another. IHVs are very sensitive to this (remember: bar graphs of today's games sell GPUs, not the promises of tomorrow). I don't doubt callable shaders are the future, but I also think you might be underestimating the "cost" to make them fully generic (and performant).

    Before you look down on the uber shader strategy, think about the practical benefits of focusing on making them faster today in hardware. What does that gain us? Well for starters it let's us "emulate" callable shaders for the first time in controlled environments (ray tracing). But just as importantly, it can also bring performance benefits to current games that leverage large uber shaders. Something that helps us step forward, but also increase (some) bar graphs. Was this exactly what you wanted? Clearly it was not. Can we agree though DXR was a huge step forward? These "bridge solutions"/black boxes/evolutions should not be treated with disdain, but celebrated! You'll get your callable shaders one day, but it may take a few evolutions to get there. :p Slowly the restrictions will melt away...

    To tie this back to ray tracing, all I'm saying is be careful for what you wish for. I don't know about you guys, but I'm not exactly blown away by ray tracing performance at the moment. How about before we really expand the flexibility of DXR, we make what we currently have (which I think can look great!) faster. :mrgreen:
     
    OlegSH, DavidGraham, BRiT and 2 others like this.
  4. willardjuice

    willardjuice super willyjuice
    Moderator Veteran Alpha

    Joined:
    May 14, 2005
    Messages:
    1,386
    Likes Received:
    299
    Location:
    NY
    I mean...it's hard to argue against something that you pulled out of thin air. Can you cite specific examples of the DXR 1.0 API you feel are not portable performance wise? I don't find that to be DXR 1.0's limitation, but rather its awkwardness with integrating with existing engines.

    This is why information can be dangerous. Yes Nvidia now includes an extra register per vector lane to keep track of program counters. No this doesn't make things magically fast. It makes things less slow (you still loose all efficiency during divergence, it just makes scheduling significantly easier during divergence within a warp...diverging is still SLOW with a capital S).

    Why is DX11 sometimes faster than DX12? Because the driver can be smarter than the programmer! The programmer has the advantage of understanding the workload better, but the driver has the advantage of understanding the hardware better. Usually understanding the workload provides better advantages, but not always! Let's pretend BF5 RT was done last year with "DXR 1.1". How could they possibly know if they scheduled their rays optimally on RDNA2? The hardware didn't exist when they added DXR! Perhaps they accidentally hit a "slow path" on RDNA2 that maybe the driver wouldn't have...

    Yeah OpenGL sucks, what do you want from me? Let me ask you a question, do you think DX11 is "bad"?
     
  5. Lurkmass

    Regular

    Joined:
    Mar 3, 2020
    Messages:
    565
    Likes Received:
    711
    Of course you still pay for the cost of executing a divergent path during shading but you can't deny that independent thread scheduling cuts the overhead cost of doing multiple function calls.

    It's hard predicting where hardware design will be headed in the extended future so I wouldn't be surprised to see the example you brought up is not up to performance but DXR 1.1 on the other hand is here to already define the present and hardware coming in the near future. I get your point that real world profiling wouldn't hurt.

    As far as D3D11 is concerned, it's days are pretty much numbered. The fxc compiler and DXBC have been deprecated so it's only a matter of time before bit rot sets in and stinks up the whole API. D3D11 will be maintained for it's legacy software support and not because some developers will be expecting to develop high performance graphics code in the future.

    Eventually D3D11 will just reach the same rotten state that OpenGL is in so why bother considering outdated tools that will become obsolete ? Do you think using unmaintained APIs are somehow less of liability to releasing your new project compared to using a more explicit API ?

    Just on principle alone DXR 1.1 should be preferred over 1.0 when it's more cross vendor friendly and it's a newer API which will be better maintained in comparison.
     
  6. iroboto

    iroboto Daft Funk
    Legend Subscriber

    Joined:
    Mar 6, 2014
    Messages:
    14,833
    Likes Received:
    18,633
    Location:
    The North
    I just assumed because not all developers are capable of coding something lower level. That’s sort of the reason at the very least to keep it around.

    Without dx11 graphics would become increasingly small pool of developers without a lot of young blood coming in
     
  7. willardjuice

    willardjuice super willyjuice
    Moderator Veteran Alpha

    Joined:
    May 14, 2005
    Messages:
    1,386
    Likes Received:
    299
    Location:
    NY
    I don't think this works the way you think this works. The feature you're describing for turing was more about making scheduling possible in various circumstances, not "reducing the cost of multiple function calls". In compute shaders, we have "threads" right? Well they aren't actually "threads" in modern GPU hardware, they are often lanes within a simd/vector unit. Do you know what happens today (pre-volta/turing) if you write a compute shader where thread N waits (e.g. an atomic lock) on thread N-1 where both N and N-1 are part of the same simd/vector unit? It'll deadlock (in hardware the entire vector unit shares the same program counter, so threads within that unit can't advance properly if there's inter-dependencies). It's perfectly valid HLSL code, but it'll still deadlock. And if you send this perfectly valid code to Intel/AMD/Nvidia and claim "BUG, IT'S BROKEN!", do you know what they'll say? "Yeah it's broken and we're not fixing it." The heroics IHVs need to go through to get that kind of complex scheduling just working within a simd/vector unit is too complicated. In practice, no one supports that functionality correctly even though "technically" they should be! Another example of a flexibility vs performance trade-off!

    That's all this feature does. It allows (albeit REALLY SLOWLY) the possibility to do complex scheduling within a vector unit. Nothing more! If you think this feature was created for callable shaders, I'm sorry to be blunt but you're simply mistaken. It just makes scheduling within a vector unit feasible (not necessarily fast or desirable)!

    Stop digging. :-D I think it's been established by now there are many use cases where 1.0 is actually beneficial over 1.1. But really, you're comparing two APIs that in all likelihood will be used together!!! It's almost like saying "well now that we have compute shaders, who'd ever use these silly limiting things called pixel shaders???". DXR 1.1 was never designed with the intention of making DXR 1.0 obsolete. I don't understand why you continue to double down on this notion that "DXR 1.0 has outlived it's purpose". That will be true one day, but that day is not DXR 1.1. They are meant to be complimentary!!!

    Finally the correct answer was DX11 might be the GOAT of graphics APIs (DX9 was p good for its time too). Was it perfect? Of course not! Isn't it telling though that five years after DX12 came out we still have AAA games that run better on DX11? :cool: (I should also point out that the old fxc compiler tends to still be better than DXIL in practice :mrgreen:) Beating well designed black boxes is harder than you think. After what we have learned from DX12, are you so sure that developers will always schedule rays more optimally than the driver? I certainly wouldn't bet the farm on it! :p
     
    pharma, DegustatoR, OlegSH and 4 others like this.
  8. Lurkmass

    Regular

    Joined:
    Mar 3, 2020
    Messages:
    565
    Likes Received:
    711
    Then consider moving onto WebGPU as a less explicit option once it releases. Better that than planning to stay on a deprecated tool down the road which will accumulate bugs for years to come in the future. Microsoft made it really clear that they will not insist on maintain the fxc compiler.
     
  9. techuse

    Veteran

    Joined:
    Feb 19, 2013
    Messages:
    1,426
    Likes Received:
    909
    DX12 fares very well on AMD GPUs the majority of the time. Its only Nvidia GPUs that always have issues.
     
  10. Lurkmass

    Regular

    Joined:
    Mar 3, 2020
    Messages:
    565
    Likes Received:
    711
    Of course I haven't forgotten about the main goal of independent thread scheduling and I'm also well aware of the issues of using a per-thread mutex with GPUs which can cause deadlocks but didn't Microsoft declare that to be undefined behaviour ? I wasn't making the point that independent thread scheduling was specifically made for callable shaders but that it was convenient to use it for that case.

    I also knew that callable shaders could be done before independent thread scheduling but on AMD that involved the scalar unit and register file loading and executing multiple shader programs which was ugly.

    On one IHV sure in the past it may have clearly been the case that D3D11 exhibited better performance for them but on the other IHV it was already an immediate improvement for them since the new API abstracted their hardware better by introducing concepts such a explicit barriers, multiple engines/queues and the new binding model.

    On some IHVs it might be better to have the black box since they obviously want to keep it badly just so that they potentially introduce some more shader replacement hacks to win in more benchmarks. With the other IHV, they just want to mostly converge on their hardware design to keep this black box 'maintenance' to a minimum so they'd rather just educate the developers about how their hardware works better.

    How "well designed" this black box is relative for each IHV out there.I guess we'll agree to disagree since we won't reach a meaningful consensus.
     
    milk and techuse like this.
  11. DavidGraham

    Veteran

    Joined:
    Dec 22, 2009
    Messages:
    3,976
    Likes Received:
    5,213
    That's not true, there is an entire thread dedicated to games performing worse in DX12 than DX11 on all GPUs.
     
  12. techuse

    Veteran

    Joined:
    Feb 19, 2013
    Messages:
    1,426
    Likes Received:
    909
    Do you agree that less than half of DX12 games perform worse than DX11 on AMD GPUs?
     
  13. DavidGraham

    Veteran

    Joined:
    Dec 22, 2009
    Messages:
    3,976
    Likes Received:
    5,213
    That's a gross oversimplifications, many developers don't share this sentiment. It was mostly a pain in the ass for a lot of developers.

    Stardock (the developer of the first ever DX12 game: Ashes of Singularity) has recently posted a blog post explaining why they abandoned DX12/Vulkan in their newest game: Star Control, in favor of DX11.

    It basically boils down to the extra effort it takes to develop the DX12/Vulkan path, longer loading times and VRAM crashes are common in DX12/Vulkan, if you don't manage everything by hand. Performance uplift is also highly dependent on the type of game, and they only achieved a 20% uplift on the DX12 path of their latest games, which they deemed not worth all the hassle of QA and bug testing.

    In the end, they advice developers to pick DX12/Vulkan based on features not performance, Ray Tracing, AI, utilizing 8 CPU cores or more, Compute-Shader based physics ..etc.

    https://www.gamasutra.com/blogs/Bra...k1u3fwoIL_QUdwlVhagyopZhQre_YywJ1qxtRiH8Rn0Zo

     
    pharma and OlegSH like this.
  14. willardjuice

    willardjuice super willyjuice
    Moderator Veteran Alpha

    Joined:
    May 14, 2005
    Messages:
    1,386
    Likes Received:
    299
    Location:
    NY
    It's perfectly legal behavior. Do you know how to solve this problem on today's hardware? It has nothing to do with GCN's scaler unit. You solve this problem today by forcing each "thread" in a compute shader to run on the entire simd/vector unit (i.e. treat the vector unit as a scaler unit). Then every compute shader "thread" get its own hardware program counter. Of course this cuts performance by 1/16 (on intel and gcn) and 1/32 (on nvidia and rdna). In practice this would make your shader useless (for real-time rendering at least). In addition, I suspect it's nearly impossible to automatically detect when you'd need to enable this "feature" when compiling a shader, thus literally all your shaders would have to see a huge drop off in performance. Do you see why it's a "won't fix" resolution by AMD/Nvidia/Intel? :-D

    Now what does Volta/Turing do for us? Well it basically let's us automatically enable the "make my shader really slow" feature. That's it! If you use this feature on turing and schedule each "compute shader thread" as if it's a "real thread", you'll still get the same 1/32 performance. You brushed off the divergence issue in your previous post, but uh...that's like still the main problem! So yes, now Nvidia can automatically enable that feature for you without making everyone else's shaders slow. But here's the thing, if you have huge divergence, your shader is still not going to be practical! That's the big problem with callable shaders (and ray tracing for that matter)!!! It's not the scheduling it's they often diverge!!! GPUs are REALLY BAD at divergence (even when scheduled well!).

    I don't understand why you are so hyped for an extra register per vector lane. Do you think all that separated pascal and making callable shaders fast was an extra register per lane? I'm not even sure if this feature is utilized/exposed on the graphics side.

    Two questions: 1) Do you think AMD's DX12 path would be faster than AMD's DX11 path as often without async compute? 2) Do you think async compute could have been added to a high level language like DX11?

    Answer Key: 1) No 2) Yes

    Bonus Answer Key: The (lack of a) black box has nothing to do with AMD's "success" on DX12

    Second Bonus Answer Key: You know what's the best part of DX12 and the removal of black boxes? Robustness. DX12 is significantly less error/bug prone than DX11 in practice due it being a "clearer" api. That's the point you should be hammering away. Forget performance, DX12's best selling point is it can save you a lot of (debugging) time!

    Third Bonus Answer Key: None of this disproves my original point. Even if we were to blindly accept that poor ole AMD was held back by DX11's big bad black box, there are still cases today where AMD's DX12 is slower. No matter how you slice it there's ample evidence that drivers can beat developers. And it's not even a rare event! If we accept this truth (which I don't really view as controversial), then we can accept there's a time and place for both DXR 1.0 and 1.1.
     
  15. trinibwoy

    trinibwoy Meh
    Legend

    Joined:
    Mar 17, 2004
    Messages:
    12,057
    Likes Received:
    3,114
    Location:
    New York
    This is what Microsoft actually says about inline RT uber shaders. They seem to think black boxes are still quite useful.

    https://microsoft.github.io/DirectX-Specs/d3d/Raytracing.html#inline-raytracing

    I imagine many PC devs will happily delegate to the driver instead of hand optimizing for every permutation of RT hardware in the market. Inline tracing is a useful tool but it's certainly not the best tool for every job.

    This assumes of course that you have a proper DXR 1.0 driver that can handle the optimization.
     
    #75 trinibwoy, Mar 24, 2020
    Last edited: Mar 24, 2020
    pharma and DegustatoR like this.
  16. JoeJ

    Veteran

    Joined:
    Apr 1, 2018
    Messages:
    1,523
    Likes Received:
    1,772
    But it's useful if not about bulk work but work generation, which my request is about.
    Leaving the context of doing RT or other massive parallel tasks, let's look at how we handle program flow on coarse level instead. We have two options:
    1. Execute shaders, download results to GPU, make depending decision on necessary future work, dispatch new shaders to execute the work. (This is very slow.)
    2. Prebuild a command buffer that contains any potential branch of control flow - all indirect dispatches and barriers that could happen, upload only once and GPU can do its thing on its own. (This is what low level API gives me and i get a speed up of two from the previous approach, which is huge win)

    The problem with option 2 is that a fixed program flow might be too limited for all cases.
    For example, i could not execute a solver in a loop and repeat until some error criterion is met. (Mantle could do this)
    But for me the main problem is that the very most indirect dispatches end up doing nothing. Their dispatch parameter ends up zero, but the barriers are still executed and take time. (Profiled 2 years ago on AMD using Vulkan)

    Also - and ironically, looking at option 2 we see the same limitations you discuss here that apply within a shader running on parallel cores, which is we execute branches of code for nothing.
    How can it be we see this same limitation on the outer, coarsest level of handling overall program flow? Obviously there is something wrong.
    And this is why it makes me mad to see introduction of HW RT, mesh shaders and other awesomeness, while not addressing such obvious shortcomings at all for many years.

    IIRC, OpenCL 2.0 has some form of shaders that only use one thread. Maybe the purpose is work generation - i do not remember. But it's clear that the inefficiency of a single thread program would not matter if it helps to solve the mentioned problems. We would execute only a tiny number of such work generating shaders.

    ...just to put my rant into context. I do not request callable shaders and using this for everything then. A 'work shader' that can generate command buffer directly on GPU would already do.
     
  17. Lurkmass

    Regular

    Joined:
    Mar 3, 2020
    Messages:
    565
    Likes Received:
    711
    I'm not sure if we're talking about the same thing but attempting global synchronization across different warps/waves is undefined behaviour in any shading language. This could either cause race conditions or in the case of a mutex it could cause a deadlock.

    Sure keeping a single lane active is a solution to this or you could just separate the thread groups into different dispatches too.

    As for the second half of your post, I'm pretty certain we're not going to see any progress so let's end it there.
     
  18. Davros

    Legend

    Joined:
    Jun 7, 2004
    Messages:
    17,879
    Likes Received:
    5,330
    Looking at some slides for DX12 ultimate one of the bullet points was
    * Hardware supporting DX12 ultimate must support ALL features.

    Now back in the dx11 days with it's feature level 1, feature level 2 ect I complained that I should just be able to buy a dx11 card and know it will support all dx11 features only to be told that this is bad and having cards that dont support certain features was a good thing. So the question is do b3d readers disagree with microsoft
     
    milk likes this.
  19. Alessio1989

    Regular

    Joined:
    Jun 6, 2015
    Messages:
    614
    Likes Received:
    321
    DX12 ultimate logo/card means a gpu that will support the upcoming D3D_FEATURE_LEVEL_12_2, which will include the new geometry pipeline, DXR tier 1.1 and sampler feedback. This is simple. There are also secondary features with single cap-bits only meant for UMA/mobile architectures or multi-gpu support. Pretending that every card will support every single cap bits is no-sense. Feature levels are meant to gather the cap-bits for the most significant features according to the existing and upcoming hardware. You can still query for single cap-bits if you know some architectures not supporting a certain feature level still support some of its features. That's the same as DirectX 10.1/11.x and still far better then Direct3D 9 and previous versions where everything was a cap-bits hell, and still better then OpenGL (which has at least the core profiles since 3.0) and Vulkan with their extension-hell.
     
    pharma, Jay and DavidGraham like this.
  20. Jay

    Jay
    Veteran

    Joined:
    Aug 3, 2013
    Messages:
    4,032
    Likes Received:
    3,428
    What it does do is give a nice baseline of feature set.
    Could've gone with DX12.5 or 13, but I think this is probably better for marketing for what is a start of a new gen for consoles.

    It would be nice to take the DX12U to other parts. Example to be DX12U compliant you need an ssd (DX Velocity) , min 6c12t CPU etc

    Then if a studio wants to they can target it knowing they don't need to support non RT cards, HDD etc.
    Even years from now will be leaving a lot of users out, but a studio may determine that the saving in production of the game is worth it.

    Let's be clear DX12U means >= RDNA2 for amd, that's a lot of users even 2 years from now that won't be supported. Factor in laptops also.
     
    pharma, milk and BRiT like this.
Loading...

Share This Page

  • About Us

    Beyond3D has been around for over a decade and prides itself on being the best place on the web for in-depth, technically-driven discussion and analysis of 3D graphics hardware. If you love pixels and transistors, you've come to the right place!

    Beyond3D is proudly published by GPU Tools Ltd.
Loading...