None of those extensions are required on DX12, and thus by Fable.It's occurred to me that NVAPI stuff has been wired into UE4 for the Fable Legends test, which is why some passes are dramatically faster on NVidia.
Not really - the vast majority of that stuff is good advice on everyones driver. Compare to AMD and Intel's recommendations at GDC and you'll see that it's mostly common. Example shameless self promotion:Looks like Nvidias driver is really a picky eater.
https://software.intel.com/sites/de...ndering-with-DirectX-12-on-Intel-Graphics.pdf
http://intelstudios.edgesuite.net/idf/2015/sf/aep/GVCS004/GVCS004.html
The divergence on NVIDIA vs. AMD/Intel is more around stuff like them packing depth+stencil together (and apparently not being too overly confident in the performance of their ROV stuff ). Overall fairly minor details that can easily be handled by engines if they want optimal performance everywhere.
There's nothing to worry about there either - same advice for everyone (see my GDC slides above). This goes along with the regular advice of scaling out to the size of the machine but no further. Parallelism has a cost and there's no point in paying it further once you've filled the machine.This is true on both the CPU and GPU - go as wide as you have to with your algorithm and no wider; run the serial algorithm in the leaves.Personally, im more worried about this one:
- Don’t create too many threads or too many command lists
- Too many threads will oversubscribe your CPU resources, whilst too many command lists may accumulate too much overhead
- Don’t create too many threads or too many command lists
Right, you do not require separate queues to avoid stalls at resource barriers - this is precisely what the "split barrier" stuff in DX12 is. As long as you ensure enough overlap/time between the "begin/end" of your split barrier the GPU may not incur any real cost at all, as many barrier operations can be farmed out to simultaneous hardware engines (blit, etc).So, if you can, you should avoid an algorithm that requires task-level barriers.
Alternatively, you create multiple, parallel, queues each of which has its own sequence of tasks separated by barriers. But, now you need an algorithm which can be chopped up into such small pieces as to be spread over multiple queues.
If you can chop up an algorithm like this, then you can also stripe the tasks in a single queue: A1, B1, C1, A2, B2, C2.