I mis-remembered the diagram then, with in-flight pixels needing to reach export before preemption can occur.http://www.anandtech.com/show/10325/the-nvidia-geforce-gtx-1080-and-1070-founders-edition-review/10
For compute, they preempt at instruction level or thread level. For graphics, it's pixel level.
My position is that there are trade-offs and that the developer must decide. That doesn't leave much room for the HWS to infer what the tradeoffs are.There will always be tradeoffs. The QoS measures should be easily disabled if there was no need. Reserve CUs for guaranteed latency, but risk them being idle. Same with reserving waves, but you only lower occupancy as opposed to idling hardware. It's ultimately up to the developer to decide which seems the proper route. Options never hurt.
Full preemption of the graphics context is very costly, and it's generally the best choice when there aren't alternatives.Yes, and full preemption would always be undesirable. That said, there are cases where it's the best choice. It could be compute preemption as well.
Compute is able to preempt at a far smaller granularity, which is where discussing this often requires stating what is being preempted, and at what granularity. There are vastly different things operating under the label of preemption.
That's called QoS, and predictable with overall performance being poorer is often what real-time entails.I'm not saying CU reservation isn't a good option for some cases. It definitely has it's uses, but will likely come at the cost of overall throughput. Throwing the entire GPU at a problem with prioritization and some contention would likely outperform a 20% reservation with guarantee. Predictable that takes longer to execute isn't ideal. Regardless, all of these options exist so a developer can decide what is best.
It is not about outperforming a solution that is highly unresponsive, it is about making the hardware acceptable for task types that will not tolerate typical GPU latencies or safe from having the system reset the device due to timeouts.
Gating off whole SIMDs is not new, while going down to the lane level has not come up for architectures in general discussion.We haven't seen it, but gating off execution units in general isn't exactly new.
Doing per-lane voltage adjustment on top of that also hasn't come up for current architectures.