Vulkan/OpenGL Next Generation Initiative: unified API for mobile and non-mobile devices.

  • Thread starter Deleted member 13524
  • Start date
Last edited:
Seems like this news puts Direct3d in a really tough spot. Personally, I'd be happy if Mac supported Vulkan and we could just have every "pc" game show up on Windows, Mac, Linux.
Vulkan is Windows + Linux + Android. Consoles have their own APIs. iOS and Mac have Metal.
 
I am not fully convinced: would make more sense having a single universal SPIR-V high level interface for me (like an intermediate API, maybe with new rebranded shading language) which can be used in both OpenCl and Vulkan.
Merging all the rest of the APIs looks like a big no-no for me. Remember also OpenCL is not meant to run on GPUs only.
Of course I may be wrong, are there any more detailed papers or slides about this?
edit: is SYCL really the answer? They adding abstraction and overhead above a low-overhead API?
 
Last edited:
I am not fully convinced: would make more sense having a single universal SPIR-V high level interface for me (like an intermediate API, maybe with new rebranded shading language) which can be used in both OpenCl and Vulkan.
Merging all the rest of the APIs looks like a big no-no for me. Remember also OpenCL is not meant to run on GPUs only.
Agreed. Also OpenCL and Vulkan have different memory models. I just hope that we get a better C++ style shading language to Vulkan with generics and other modern features. Like Metal and CUDA. HLSL and GLSL are designed for DX9 hardware, to describe small (<64 instruction) pixel/vertex shader programs. We need a better language for compute shaders.
 
I'll disagree here, HLSL & GLSL are C derivatives, although they lack pointers [which reduces their versatility] they by no means are preventing good programming practice.
You have been forever able to forward declare struct & functions, you can append then to generate your programs and don't have to rely on that horrible pre-processor [that really should be forbidden], it encourages good design. (Well it would w/o that horror that a string pre processor is)
But I agree a C with generics (and pointers) would be much more powerful and close to my ideal language (be it for GPU or CPU).

I'm interested in the scan and reduce functions, the pipelines and the ability for the GPU to drive itself to some extent.
I can't wait to have CPU & GPU living their own lives with some async communication channel. (MPI like conceptually)
 
I'll disagree here, HLSL & GLSL are C derivatives, although they lack pointers [which reduces their versatility] they by no means are preventing good programming practice.
You have been forever able to forward declare struct & functions, you can append then to generate your programs and don't have to rely on that horrible pre-processor [that really should be forbidden], it encourages good design. (Well it would w/o that horror that a string pre processor is)
But I agree a C with generics (and pointers) would be much more powerful and close to my ideal language (be it for GPU or CPU).
There are lots of stupid limitations. Let me list some...

- I can't pass references to groupshared memory arrays as function pointers. The only way to access groupshared memory array is by hardcoding it's name (to the global declaration). Thus it becomes impossible to write generic reduction functions scan/sum/avg/etc (even for a hardcoded type).
- I can't reuse the same groupshared memory region for multiple different typed arrays. If I want to first process int4 data in groupshared mem and then float3 data in groushared mem I need to declare both arrays separately -> much bigger groupshared mem allocation.
- There's no generic types. For example: I want to do a sorting function, I need to copy & paste the whole implementation for each different type.
- There's no lambdas/closures. For example: I have a very well optimized (complex) ray tracer loop, but I want to customize the inner loop sampling function. A call to lambda/closure parameter would be a perfect solution, but HLSL doesn't support it. The lambda would obviously be inlined, so the resulting shader would be identical to copy & paste.

I don't need classes or inheritance or silly things like that. C with generics + lambdas is fine for me. I only need features that make it possible to write reusable basic function libraries for compute shaders. Currently I have lots of macro hacks and copy & paste. I want to get rid of this bullshit.
 
Interesting comparison between AMD and NV in Vulkan on Linux, the site used a 1060 vs 580, using 3 different CPUs, ranging from a Celeron G, Pentium G, and an i7 7700K. They tested 5 Vulkan games (Dota 2, Talos Principle, Mad Max, Serious Sam 3, and Dawn Of War 3). They also tested the OpenGL path as well.

https://www.phoronix.com/scan.php?page=article&item=kblcpus-gl-vlk&num=2

Generally speaking NVIDIA was much faster in both the OGL path and Vulkan, it's lead in Vulkan extended anywhere from 20% to 57% depending on the title. Some of the Vulkan games were 10~15% slower on both AMD and NVIDIA than OpenGL though. Others were faster with Vulkan.

NVIDIA's OpenGL and Vulkan CPU overhead was smaller, and the GTX 1060 was able to extract more performance out of the Celeron CPU in both Vulkan and OpenGL.
 
I just ran the 3D Mark API Overhead feature test (V2.0) on my i7-3770K (4GHz all cores) + GTX970 (OCed a bit). I'm still on Win7 so no DX12 but I found the Vulkan result interesting.

DirectX 11 single-thread -- 2 464 400 Draw calls per second
DirectX 11 multi-thread --- 3 057 329 Draw calls per second
Vulkan -------------------------- 17 128 458 Draw calls per second

I've always heard that NVIDIA cards (especially pre-Pascal) don't gain much from DX12/Vulkan but I guess it does make a bit of difference. 3 million vs 17 million lmao.
 
I just ran the 3D Mark API Overhead feature test (V2.0) on my i7-3770K (4GHz all cores) + GTX970 (OCed a bit). I'm still on Win7 so no DX12 but I found the Vulkan result interesting.

DirectX 11 single-thread -- 2 464 400 Draw calls per second
DirectX 11 multi-thread --- 3 057 329 Draw calls per second
Vulkan -------------------------- 17 128 458 Draw calls per second

I've always heard that NVIDIA cards (especially pre-Pascal) don't gain much from DX12/Vulkan but I guess it does make a bit of difference. 3 million vs 17 million lmao.

That has nothing to do with hardware, that's the CPU overhead of the run-time.
 
There are two reasons why DX12/Vulkan have been "disappointing" ("don't gain much") thus far:

1) While you can create micro-benchmarks showcasing the cost of "high" API overhead, in real life this essentially never happens. In fact the vast majority of the time you are bottlenecked by the gpu. And in the cases where you are bottlenecked by the cpu, it's never because of API overhead (usually physics, game logic, etc.). So while the API overhead reductions that DX12/Vulkan have made are very real, we just haven't found a practical use case yet to really utilize that potential. I do blame MS/IHVS/etc. for making this seem like a big issue that DX12/Vulkan could solve. It's not even a small issue at this point in time imo.

2) Most DX12/Vulkan paths in engines at the moment are just "ports" from their DX11 path. Meaning that they are essentially trying to "emulate" what existing DX11 drivers do (this is an oversimplification but for the moment trust my hand waving). Well guess what? Nvidia/AMD/Intel are better at writing "drivers" than game developers. They've been in the "driver business" FAR longer and have MANY more resources to throw at the problem. They also know their hardware better than game developers. ;-) Many developers will admit in private that without async compute, their DX12/Vulkan paths would always be slower than DX11 (i.e. in completely gpu bound situations, their DX11 path would be faster than their DX12 path). This is where your "(especially pre-Pascal)" comment comes into play. Async compute can only help if you have idle units to spare. If Kepler was already near "full capacity" with DX11, async compute is not going to help much. In fact if your "async compute implementation" has a big enough overhead, it can actually hurt. For async compute to pay off, you need to utilize enough idle units to overcome the overhead that async compute can introduce. In practice, this didn't seem to happen very often for Kepler. AMD has gotten a lot of credit for their async compute support (and to be clear, they've done a great job with it on both the hardware and software side), but really what we are saying is their DX11 driver (for whatever reason) left a lot of units idle (comparatively to the competition). Thus they had the most to gain. The takeaway here is if your current DX11 path is already utilizing the gpu well, your (naive) DX12 path will be slower even with async compute.

The "overall takeaway" is while Vulkan and DX12 have a lot of potential, at the moment we (as a community) are not in the position to make use of them (at least in a revolutionary way). The reality is developers still need to support DX11 and at the moment it's difficult to formulate an abstraction layer that will support both "DX12 style rendering" and DX11 style rendering" efficiently. It'll be a bit before we can fire on all cylinders, but we'll get there! :D
 
That has nothing to do with hardware, that's the CPU overhead of the run-time.
But the GPU + driver is not irrelevant. It's more of a software issue but the hardware does matter to some extent (for example the DX12/Vulkan way of issuing commands would not work on G80).

On another note, DX11 games are rarely limited by draw calls since there are many ways to reduce them but at the end of the day they still take a significant amount of CPU time. In DX12/Vukan, those tricks can still be used and the amount of CPU time spent issuing commands to the GPU should be minimal right?
 
On another note, DX11 games are rarely limited by draw calls since there are many ways to reduce them but at the end of the day they still take a significant amount of CPU time. In DX12/Vukan, those tricks can still be used and the amount of CPU time spent issuing commands to the GPU should be minimal right?
Well... You need to know the circumstances in which the numbers above from 3D API Overhead feature test are achieved. That is pretty much just draw calls without much changing of state in between. At the end of day even in DX11 most of the CPU cost is from building and changing graphics state and not directly from issuing draw calls and that's not something that DX12/Vulkan made significantly faster.
 
Well... You need to know the circumstances in which the numbers above from 3D API Overhead feature test are achieved. That is pretty much just draw calls without much changing of state in between. At the end of day even in DX11 most of the CPU cost is from building and changing graphics state and not directly from issuing draw calls and that's not something that DX12/Vulkan made significantly faster.
Pardon my ignorance, but is that true for GCN cards as well?
 
Of course. I'm not sure why would it be different? You still need to compile DXIL shaders to something thats actually executable by hardware for example. Constructing PSOs is one of the most expensive operations in D3D12. So much so that there is now ID3D12PipelineLibrary object which allows developers to efficiently cache them and store them to disk.
 
But the GPU + driver is not irrelevant.

For this kind of test it pretty much is [irrelevant] for DX12/Vulkan. Everything used for drawing are monolithique precompiled object which are on the GPU already, so even if the DX12 runtime would have to check dirty state (which it doesn't) it would have to check about two to three orders of magnitude less CPU-side state than DX11. Also, the [optimal] paradigm to manage dynamic data is so vastly different that you automatically probably have a factor two or three speedup passing vertex/index data to the GPU on every call - depends what they did (which is possible to verify with renderdoc BTW) eg. MAP_DISCARD.

It's more of a software issue but the hardware does matter to some extent (for example the DX12/Vulkan way of issuing commands would not work on G80).

Of course it would [be faster] - it would either approach the command-processor bottleneck, possibly the vertex-fetch limit, or the ROP fillrate limit, instead of the DX11 run-time limit.
You might think the speedup is not so much because you are probably thinking of driving a G80 with a 4GHz Skylake. You have to think of an appropriate CPU of that time, and then (kind of like inflation-equalization) the relative factor between DX12-style and DX11-style submission would likely be about the same as today with Pascal or Vega driven by that Skylake. Remember you measured a factor 6 , that'd be about the delta between CPU side draw-cost and GPU side draw-cost. Or maybe not, you still don't know if you hit GPU limit, or if it's still the i7-3770K not being able to scale further up. You could try to find 970s with even higher rate in a public database.

On another note, DX11 games are rarely limited by draw calls since there are many ways to reduce them but at the end of the day they still take a significant amount of CPU time.

Most DX11 games and engines are purely CPU-side draw-call limited, that's why they get away with so much brute-force post-processing instead, which only use a few draw-calls to saturate the GPU. You're mistaking cause and effect: you don't see much calls in the games because they are slow in the first place and the games are distorted to hit a given speed before release, assets and source code equally. This makes porting to new paradigms very hard, because you have to scrap 50% of the render-loop workaround hacks on the low-level and have to replace them by decent strategies on the high-level.

In DX12/Vukan, those tricks can still be used and the amount of CPU time spent issuing commands to the GPU should be minimal right?

You can use the tricks, but it's pointless to render a 60 FPS game at 120 FPS, you prefer to run the 60 FPS with twice the content.
 
Back
Top