DX12 Performance Discussion And Analysis Thread

I don't know if this can be answered, but is it AMD or Nvidia that provide a chart of state change costs, or the console makers?
At the very least, they might pay enough money to the GPU makers to tabulate the results for the specific models in the platform.
There's much less incentive to track the larger number of revisions and SKUs in the PC space, particularly when APIs were at their thickest.
 
It would be great if AMD released PC profiling tools that show per CU timeline for wave occupation (how many waves of each shader is running on each CU at every time slice) + markers for various stalls. This would give the programmer enough information to optimize their DX12 code properly regarding to concurrent execution (barriers & async compute). Current PC tools are not good enough for this.
Agreed on wanting more info all around, although it's worth noting that for future perf portability you don't want to go as low as you do on the consoles, even if you can. This extends to games ported from console although those will tend to undershoot PC GPU perf anyways so it's not as bad.

If you get too timing and resource/overlap specific in your optimization you can actually hurt performance on potential future architectures, similar to stuff like software prefetch which almost always ends up being de-optimization in the long run.
 
I don't know if this can be answered, but is it AMD or Nvidia that provide a chart of state change costs, or the console makers?
At the very least, they might pay enough money to the GPU makers to tabulate the results for the specific models in the platform.
There's much less incentive to track the larger number of revisions and SKUs in the PC space, particularly when APIs were at their thickest.
All documentation is provided by the console makers though data may be indirectly from the hardware vendor.
 
This might go slightly off the topic, but is there any OSD's out there that actually work with D3D12? MSI Afterburners and Steams at least don't seem to work in Caffeine (first D3D12 supporting title released so far, not counting beta-releases and whatnot)
Or any other generic tools that could at least make log from the FPS/frametimes?
 
This might go slightly off the topic, but is there any OSD's out there that actually work with D3D12? MSI Afterburners and Steams at least don't seem to work in Caffeine (first D3D12 supporting title released so far, not counting beta-releases and whatnot)
Or any other generic tools that could at least make log from the FPS/frametimes?
Nothing that I'm aware of yet.
 
Agreed on wanting more info all around, although it's worth noting that for future perf portability you don't want to go as low as you do on the consoles, even if you can. This extends to games ported from console although those will tend to undershoot PC GPU perf anyways so it's not as bad.

If you get too timing and resource/overlap specific in your optimization you can actually hurt performance on potential future architectures, similar to stuff like software prefetch which almost always ends up being de-optimization in the long run.
You would want to confirm that during the whole frame each CU occupancy is high. If some of your draw calls / kernels do not fully occupy the whole GPU, you need to figure out ways to increase the amount of parallel work (by reorganizing your frame and your barriers). Same is true for choke points (where the whole GPU practically stalls for a short while waiting something to finish and it takes quite a while to ramp up to full occupation again, as ramping up causes so many cache misses). You want to be doing something else using async compute when you main queue stalls.

Good tools help detecting problems like this. In general, optimizations that give the GPU more available parallelism also help scaling with newer GPUs (as GPUs tend to get wider all the time). Too much overlap might in some cases make the memory access patterns worse (increasing cache trashing), but this is almost never a problem in properly optimized shaders, especially when you use threadgroup shared memory.

I fully agree with software prefetch. Our old Xbox 360 (PPC) code was filled with software prefetch instructions. Modern x86 CPUs do not need them to perform well. Software prefetch requires hand optimized timing to get any gains, otherwise it just trashes your caches. Hand optimized timing on PC CPUs would be impossible, and OoO complicates timings further.

Cuckoo hash lookup is an interesting special case for software prefetch. You might need to perform two random accesses, and both of them might cache miss. If you prefetch the second and then fetch the first, you are guaranteed that you never need to wait for more than a single cache miss (as the memory system can fetch both simultaneously). Obviously if you don't need the second (the key was found in the first), you lose some bandwidth, but you don't need to wait for that data either. So this is a bandwidth to latency trade off. Personally I don't like hashing, as it randomizes the memory accesses (practically guaranteeing cache misses). Linear accesses (or almost linear ones) are much more preferable.
 
Last edited:
Perhaps PC games should micro-benchmark the chip they're running on to enable some decisions about certain trade-offs, or make such choices explicit advanced graphics options for the gamer.
 
Perhaps PC games should micro-benchmark the chip they're running on to enable some decisions about certain trade-offs, or make such choices explicit advanced graphics options for the gamer.
Sounds appealing at first, but that's not such a good idea. Synthetic benchmarks are only really going to work for a known hardware. Future generations may screw your benchmark unintentionally, so you would really need to profile your full application to make any sound assumptions about the platform.

Besides, games are already doing that, partially. Mostly when they have an "autodetect" button for graphic settings. Which is rarely working as expected. More often than not it's either aiming way too high or way too low.

If anything, would would want to profile while the player is gaming, as this would also enable you to tune for extrem ingame situations.

But ultimately, you don't want to probe for the hardware capabilities at all. You want to build an engine layout which scales well regardless on which platform it runs on. Which is flexible enough to adapt to a wide range of hardware configurations without any explicit notion.

And still, in order to design such a system properly, you need to know the edge cases. You must know what worst case scenarios you need to avoid at all costs or at least where you need to stay flexible enough to give the driver/hardware that option.

If you get too timing and resource/overlap specific in your optimization you can actually hurt performance on potential future architectures, similar to stuff like software prefetch which almost always ends up being de-optimization in the long run.
You don't want to optimize for it in the sense of enforcing it. But you don't want to prevent these "optimizations" from occurring naturally either. You want to know how flexible you need to remain to cover all the extremes found in current and upcoming hardware generations.

Prefetch is actually a good example. Explicit, manual prefetch is only going to work well for a specific architecture. But restructuring your code to hide memory latencies by reducing the number of instructions your memory access depends on and performing memory access early and in batch, is going to benefit almost every platform, regardless of OoO, speculative execution and whatever.

Because you were aware upfront that a certain workload could have caused a disadvantage, and you knew how to mitigate the effects by avoiding the possible worst case scenarios. And you gave the compiler a chance to do its job properly, even if it decided to restructure your code again and not to use the safety net.
 
If anything, would would want to profile while the player is gaming, as this would also enable you to tune for extrem ingame situations.
Modern architectures do provide more means of gathering data at runtime, although one of the extreme in-game situations introduced with game-time profiling is going to be "the profiling is excessively impacting performance".
In the CPU realm, Intel and AMD have various tools.
One example is AMD's lightweight profiling instructions, which all signs point to being discarded in future cores. It seems like some developers did like it.

But ultimately, you don't want to probe for the hardware capabilities at all. You want to build an engine layout which scales well regardless on which platform it runs on. Which is flexible enough to adapt to a wide range of hardware configurations without any explicit notion.
This a nice theoretical ideal, but the potential glass jaws can have implementations dropping to fallbacks, emulation, bus transfers, or (heaven forbid) IO when they simply don't crash. This means differing platforms, even more advanced versions of the same platform, do not provide a smooth function for determining how an engine should scale. Given the complex problem space, the desire to create an engine that can optimize for any contigency runs into the desire to ship something, and it assumes that being able to handle anything doesn't yield the result of making everything suboptimal.
The latter case is a perennial problem since computer scientists have started working on dynamically optimizing software runtimes, which is not a new concept and which has some imperfect exemplars in those thick API black box drivers whose deprecation is being celebrated.

Being generally scalable in such a situation is still possible, as CPU vendors try to do when they tout super high scaling numbers in the absence of absolute measurements--since its often exponentially easier to get fantastic scaling figures once you set the baseline low enough. When facing a need to scale when one platform or another can fall down by an order of magnitude, a good baseline for scaling may not be the same as a good baseline for acceptability.

And still, in order to design such a system properly, you need to know the edge cases. You must know what worst case scenarios you need to avoid at all costs or at least where you need to stay flexible enough to give the driver/hardware that option.
This unfortunately runs against the desire to not probe for capabilities, since those are a major source of edge cases, and knowing what edge cases there may be in a dynamic context and with evolving platforms is a tall order.

Prefetch is actually a good example. Explicit, manual prefetch is only going to work well for a specific architecture. But restructuring your code to hide memory latencies by reducing the number of instructions your memory access depends on and performing memory access early and in batch, is going to benefit almost every platform, regardless of OoO, speculative execution and whatever.
Aggressively optimizing memory accesses statically has a cost, particularly once you exhaust the number of safe accesses that can be determined at coding or compile time.
That's not to say that such optimizations are not great things to do, just that in many realms designers have moved past where comparatively low-hanging fruit has been plucked, although I may be speaking out of turn since it seems there's a lot of things GPU and game development seem to treat as being new that are not.

On top of that, the more heroic the reorganization, the larger the compiler's optimization space and the human programmer's headspace need to be, oftentimes for an optimization that might not break even with its own overhead.
It would be nice if there were a sufficiently light and expressive way to inform the code of what performance state the system is in on a dynamic basis, and that the software could readily mold to fit. This is a very non-trivial problem, unfortunately, and progress at a hardware and platform level is uneven.

You don't want to optimize for it in the sense of enforcing it. But you don't want to prevent these "optimizations" from occurring naturally either. You want to know how flexible you need to remain to cover all the extremes found in current and upcoming hardware generations.
This is traveling into the territory of asking the developer and the software to know the unknowable. How is a developer to know what decisions might break an optimization that does not yet exist?
 
You don't want to optimize for it in the sense of enforcing it. But you don't want to prevent these "optimizations" from occurring naturally either. You want to know how flexible you need to remain to cover all the extremes found in current and upcoming hardware generations.
Absolutely but that goes back to my previous point - for high level scheduling stuff, going "as parallel as possible" is actually sub-optimal as there is overhead to parallelism. This is the same reason you don't do a pairwise reduction on a narrow machine; you go only as wide as you need to to fill the machine, and then run the most efficient algorithm on each "core", which is usually the serial algorithm.

There's wiggle room on both sides of the "optimal" point of course, but my point is that doing something like putting as much stuff as possible onto separate async queues is not generally a good idea. For instance, why not split up every CS dispatch into each group and put them on separate queues? Obviously a contrived example since the API already defines these to be "independent", but I hope the point is clear: there is a cost to parallelism and how much you should throw at a machine depends on the width of the machine. If you get too far off in either direction you will run sub-optimally.

Because you were aware upfront that a certain workload could have caused a disadvantage, and you knew how to mitigate the effects by avoiding the possible worst case scenarios. And you gave the compiler a chance to do its job properly, even if it decided to restructure your code again and not to use the safety net.
Right so the only thing I'd caveat in the case of DX12 here is that it's not a nice situation where you - for instance - give the driver a DAG and let it sort out the most efficient way to map it to hardware. It's much more like SMT where you're paying a cost to spawn something fairly heavy-weight (a thread) and all the associated synchronization that you hope to recover via parallelism. On a machine that doesn't have SMT you shouldn't launch that additional thread as it will be pure overhead, and there will likely be similar best practices for various architectures in this area depending on how efficiently each can keep itself fed from various mechanisms.
 
Absolutely but that goes back to my previous point - for high level scheduling stuff, going "as parallel as possible" is actually sub-optimal as there is overhead to parallelism. This is the same reason you don't do a pairwise reduction on a narrow machine; you go only as wide as you need to to fill the machine, and then run the most efficient algorithm on each "core", which is usually the serial algorithm.
Of course. Using an inferior algorithm that requires more data transfer (usually more thread blocks = more data transfer) will be slower on narrow machines. I do not advocate these kind of optimizatioms (unless the trade off is minimal, or you have absolutely nothing else to do at that point of the frame). Fortunately DX12 gives you also more control about running multiple narrow kernels simultaneously. Put them into the same queue, and do not limit the parallelism between them by barriers. Narrow GPUs will execute the kernels one at time (starting the next one when the previous starts to fade away) and wider GPU will execute them all concurrently.
Right so the only thing I'd caveat in the case of DX12 here is that it's not a nice situation where you - for instance - give the driver a DAG and let it sort out the most efficient way to map it to hardware.
You can write the DAG yourself :). Unfortunately on PC we still don't know the exact hardware details about the scheduling on all GPUs, so it is hard to make this perfect on PC.
 
Folks, just to let you know, Intel drivers 20.19.15.4300 are available on Windows Update Catalog. I don't have the release notes, but they work fine on my SP3 (Haswell HD 4400), and looks like they finally correctly handle plug and unplug of AC power if different refresh rate are setted on Windows 10 (ie: the screen now correctly switch between different refresh rate).
 

Well the Alpha only supports AFR for now, so a weak iGPU may not make that much of a difference if you have a medium-range desktop card or better.

Regardless, I guess from now on, no one will ever replace graphics cards, they'll just add new ones. The market for huge motherboards with lots of PCIe slots will blow up as well as the market for >1200W PSUs.
 
Well, I don't think I want to play with igpu tagging along in AFR mode... How many miliseconds does it take to render that 3840x2160 2x MSAA frame on say a Haswell GPU? 980 Ti is 30ms...
 
I still don't believe in DX12 multi adapter. It's going to be buggy, often times it will cause a slowdown rather than a speed up, and the performance gains that it provides are minimal. It's the classic example of technology for its own sake. Double the complexity of the rendering engine, ask game developers to optimize for a combinatorial space of GPU combinations, and then (from MS slides) you go from 35 to 39 FPS!!

What a disaster.

Game developers can't make simple games stable these days. This just makes it all worse. And for what?
 
Perhaps you should read the Anandtech article before reaching that conclusion...
 
I must admit I'm not terribly excited by the potential of this particular use for multi adapter. I'd much rather see this used with iGPU's to give PC's that much vaunted (but rarely witnessed as far as I can see) HSA capability that consoles have for low latency communications between CPU and GPU while still allowing all the benefits of a massive discreet GPU for the heavy duty rendering work that doesn't require low latency comms with the CPU. That's the best of all worlds IMO.
 
I still don't believe in DX12 multi adapter. It's going to be buggy, often times it will cause a slowdown rather than a speed up, and the performance gains that it provides are minimal. It's the classic example of technology for its own sake. Double the complexity of the rendering engine, ask game developers to optimize for a combinatorial space of GPU combinations, and then (from MS slides) you go from 35 to 39 FPS!!

What a disaster.

Game developers can't make simple games stable these days. This just makes it all worse. And for what?
The DirectX 11 automatic AFR only worked well in simple cases. Developers could not implement modern optimizations to their engines (such as re-using / reprojecting last frame data or doing partial data updates). Automatic AFR required per-title driver hacks to work optimally, meaning that only AAA games from the biggest studios worked well. Sometimes the driver release lagged the game launche, meaning that AFR wasn't working well at game launch.

Automatic AFR was a disaster. Now the developer at least has a chance to fix the problems themselves.
 
Back
Top