AMD: Speculation, Rumors, and Discussion (Archive)

Status
Not open for further replies.
It's possibly not just the shaders in the preemption case, since the compute submission isn't allowed to ramp until the graphics portion goes to zero.
I suspect that the preemption referred to in that article is what Windows (10) does. I suspect this is a brute force kind of preemption which is asking the driver to "wipe the slate clean". So while a compute task is shown taking over in this case, it could, equally, be another graphics task. That could explain the "compute submission isn't allowed to ramp until the graphics portion goes to zero." aspect of this. Which would look to the driver (at least, perhaps the hardware too?) like what silent_guy is alluding to: a mass of in-GPU state has to be protected from the new task, which has free reign.

The command processor has run ahead an unknown number of commands in the queues, in-flight state changes, the order of them, barriers, and pending messages to and from the rest of the system need to be resolved.
Graphics state rollovers have a similar drain, but not necessarily every change does that. I wonder how that interacts with a graphics preemption, or if the state changes that significant are the ones that restrict concurrency among graphics wavefronts in the first place.
I think there are multiple definitions of "preempt" that are probably in contention here.

I'm curious if the quick response queue can really guarantee that kind of resource ramp, or the floor value for graphics utilization. Does having that mode on mean the driver or GPU is quietly reserving wavefront slots away from one function or the other to make sure they are available?
It's just prioritisation of work in the queue, surely. Only asynchronous shaders can go in this queue and it's just turning up their priority to 11.
 
It's just prioritisation of work in the queue, surely. Only asynchronous shaders can go in this queue and it's just turning up their priority to 11.
Just reread the paper, guess you are right, fine grained preemption is limited to swapping out compute kernels only. No idea where I got the preemption of draw calls from.
Great, that means Pascal still can't even remotely guarantee any latencies for async compute as long as there is activity on the 3D queue, which means async timewarp still conflicts with long running draw calls.
 
I suspect that the preemption referred to in that article is what Windows (10) does. I suspect this is a brute force kind of preemption which is asking the driver to "wipe the slate clean". So while a compute task is shown taking over in this case, it could, equally, be another graphics task.
That's slightly less brute-force than what the OS can resort to in the absence of graphics preemption, which is wipe out the application that is violating the device's response time thresholds--or crash everything, sometimes.
The failure to respond to that kind of preemption request reminds me of the problems some posters had with the benchmark in the DX12 thread. Unless preemption support is in place, I would worry about any solution that relied on that method.

edit: On second thought, I think I am mixing the various forms of preemption and might need to rule out the OS being key to this. Possibly, it's an example of how bad preemption can be, but the OS is not going to care about preempting something at the frame times in question. Given its concern about system QoS, the OS isn't going to concern itself with an intra-frame problem within an application's own display logic.
The latency for any kind of queuing at the OS level could exceed what frame time there is. This could be a command in the code, or an automatic trigger the GPU has to some kind of state change.

I think there are multiple definitions of "preempt" that are probably in contention here.
There would seemingly need to be, although they aren't isolated from one another.
Being able to freely preempt a Graphics context without resetting the device does imply the ability to preempt stubborn graphics wavefronts that are refusing to complete on time.
However, just being able to preempt at the finer level leaves open questions about the global context and ordering that a graphics wavefront and any number of concurrent wavefronts (depending on some varying amount of buffered states of indeterminate velocity through the overall pipeline) depend on.

It's just prioritisation of work in the queue, surely. Only asynchronous shaders can go in this queue and it's just turning up their priority to 11.
This goes to one of the definitions of preempt at a graphics wavefront level. It's a marketing picture, so it is free to be idealized and optimistic, but turning up wavefront priority pre-launch doesn't help in the unlucky case that the existing graphics wavefronts take an uncomfortably long time to complete.
 
Last edited:
Just reread the paper, guess you are right, fine grained preemption is limited to swapping out compute kernels only.
While that might be true, I can't see how that can be concluded.

Great, that means Pascal still can't even remotely guarantee any latencies for async compute as long as there is activity on the 3D queue, which means async timewarp still conflicts with long running draw calls.
I don't see how you can conclude that. NVidia has written about preemption. An omission on the subject of hardware support for simultaneous execution of tasks can't be taken as evidence it's not there. The paper is written for people who aren't writing rendering code.

Also, AMD isn't guaranteeing latencies. It's merely allowing prioritisation, with extremely coarse granularity (off or on!). It could be argued that preemption on AMD is very slow (drain-out of the task and state export? - then reversed), so when it comes to making a choice about the technique for minimal-latency, just-in-time kernel execution it's more effective to use the quick response queue. Other hardware (e.g. Pascal) might be much faster at preemption and it might turn out to be preferable to use that style of task scheduling.

The graphs in AMD's article "indicate" that the preemption case results in the slowest of all total latency. But at the same time, the compute task finishes earliest in this case. Earliest possible finish, at the cost of longest overall duration, probably isn't the choice most developers would make, whatever the application. Note though that this choice does add latency to both tasks when compared with the pure "async" case.

Preemption can in theory result in overlapping execution of tasks, if the two tasks don't share state storage that can only hold the state of any single task at one time. Which is why I think graphics and compute should be able to overlap during state-drain and state-fill. But the picture for this in AMD's article indicates that doesn't happen.
 
That's slightly less brute-force than what the OS can resort to in the absence of graphics preemption, which is wipe out the application that is violating the device's response time thresholds--or crash everything, sometimes.
The failure to respond to that kind of preemption request reminds me of the problems some posters had with the benchmark in the DX12 thread. Unless preemption support is in place, I would worry about any solution that relied on that method.
Years ago there was no time out detection and recovery in Windows - or if you prefer, it could be turned off. Yes, I have written infinite loops on the GPU and had to hit the power switch :p

Many times.

I have code that, given a sufficiently slow GPU and rich enough parameters, will enqueue kernels that run for much more than a second. It makes browsing or gaming quite jerky. Even when kernel run times are in milliseconds there can still be jerkiness if gaming at the same time...

edit: On second thought, I think I am mixing the various forms of preemption and might need to rule out the OS being key to this. Possibly, it's an example of how bad preemption can be, but the OS is not going to care about preempting something at the frame times in question. Given its concern about system QoS, the OS isn't going to concern itself with an intra-frame problem within an application's own display logic.
The latency for any kind of queuing at the OS level could exceed what frame time there is. This could be a command in the code, or an automatic trigger the GPU has to some kind of state change.
While that's something I take as read in this discussion, I think it's also fair to say it's not that hard to get a game (old GPU plus max resolution plus max settings) to run at less than 1fps and then you wonder how long before TDR...

There would seemingly need to be, although they aren't isolated from one another.
Being able to freely preempt a Graphics context without resetting the device does imply the ability to preempt stubborn graphics wavefronts that are refusing to complete on time.
However, just being able to preempt at the finer level leaves open questions about the global context and ordering that a graphics wavefront and any number of concurrent wavefronts (depending on some varying amount of buffered states of indeterminate velocity through the overall pipeline) depend on.


This goes to one of the definitions of preempt at a graphics wavefront level. It's a marketing picture, so it is free to be idealized and optimistic, but turning up wavefront priority pre-launch doesn't help in the unlucky case that the existing graphics wavefronts take an uncomfortably long time to complete.
The quick response queue isn't preempting though. It's just prioritisation. It's no different than if you tell your operating system to run an ordinary application at maximum priority. If that application happens to soak your CPU cores and one or more disks to near 100% utilisation, everything else you try to do on the PC will get miserably slower.

It's up to the graphics programmer to decide whether quick response queue is appropriate. If graphics wavefronts are taking uncomfortably long then it's their own fault.
 
This goes to one of the definitions of preempt at a graphics wavefront level. It's a marketing picture, so it is free to be idealized and optimistic, but turning up wavefront priority pre-launch doesn't help in the unlucky case that the existing graphics wavefronts take an uncomfortably long time to complete.
Speculating on behavior here, but to achieve a benefit from concurrent execution it would make sense CU/SIMDs are split among a plurality of different kernels. Balancing the load based on perceived bottleneck. Some portion of the processing resources may likely be reserved. CUs could be running at the lowest possible occupancy to maintain full utilization. In the case of preemption, this could allow waves to be scheduled immediately. It would also decrease the likelihood of long running compute and graphics tasks running simultaneously. If the driver was readily detecting preemption, it could start reserving space as well. Might not work for the first frame, but it should be manageable. At the very least pieces keep moving.

Also, AMD isn't guaranteeing latencies. It's merely allowing prioritisation, with extremely coarse granularity (off or on!). It could be argued that preemption on AMD is very slow (drain-out of the task and state export? - then reversed), so when it comes to making a choice about the technique for minimal-latency, just-in-time kernel execution it's more effective to use the quick response queue. Other hardware (e.g. Pascal) might be much faster at preemption and it might turn out to be preferable to use that style of task scheduling.
Should be 4 priority levels for compute kernels as I recall from their ISA Whitepaper. How they're exposed to APIs is another matter.
 
Just reread the paper, guess you are right, fine grained preemption is limited to swapping out compute kernels only. No idea where I got the preemption of draw calls from.
Great, that means Pascal still can't even remotely guarantee any latencies for async compute as long as there is activity on the 3D queue, which means async timewarp still conflicts with long running draw calls.

Worth remembering that paper is specifically looking at Pascal from the eyes of a Tesla implementation.
I think some did manage to discuss pre-emption/draw call aspect with NVIDIA staff after the conference when the P100 was presented.
While still vague I remember one generalised quote being that from a graphics perspective latency improvement will be around 5ms - the same as when Sony mentioned their own async compute benefit with PS4 development and ironically 5ms back then themselves.
Cheers
Orb
 
I meant to put latency in quotes because it is not clear whether they meant 5ms per frame (which ties into the PS4 development presentation point) or really latency but for some reason they missed out discussing the scheduling implementation change that has happened to Pascal and would also be relevant. *shrug*.
Anyway just to emphasise that paper will be looking at this from the perspective of Tesla and that is an accelerator/co-processor, which would not cover questions being raised, including how the new scheduler is implemented.
Cheers
 
While that's something I take as read in this discussion, I think it's also fair to say it's not that hard to get a game (old GPU plus max resolution plus max settings) to run at less than 1fps and then you wonder how long before TDR...
At least some of the promise with the idea of graphics preemption is that TDR is less likely to be tripped in that case, since the GPU more gracefully respond when the OS asks for it.

At this point, the concept of preemption seems to be three or so layers deep.
The outermost layer being OS QoS and device preemption, which can be tripped if the GPU's graphics pipeline does not respond, whose response can be held up by a wavefront.
Then there's the GPU's ability to preempt for the purposes of AMD's diagram, which can be hobbled up by a lack of responsiveness at the wavefront level.
Then there's the wavefront instruction's level of preemption, which might be something that can happen more often with future hardware.

The quick response queue isn't preempting though. It's just prioritisation.
That's my general assumption of how optimistic the marketing diagram is, given how hardware is currently rather inflexible in that regard.

It's up to the graphics programmer to decide whether quick response queue is appropriate. If graphics wavefronts are taking uncomfortably long then it's their own fault.
The curve would be closer to the ideal in the marketing, if the GPU could more assertively reapportion its resources. Sony's VR at least externally can take things to 120 FPS, and if future VR hardware and software moves to use that rate natively, having smaller error bars on the GPU's response time could make the tool even more of a value-add.

Speculating on behavior here, but to achieve a benefit from concurrent execution it would make sense CU/SIMDs are split among a plurality of different kernels. Balancing the load based on perceived bottleneck. Some portion of the processing resources may likely be reserved. CUs could be running at the lowest possible occupancy to maintain full utilization.
Automatically determining that sweet spot could be difficult. The inverse, where a developer purposefully creates long-lived compute tasks and possibly uses synchronization with atomics rather than queue-level barriers could engineer a low-latency response.

If synchronization were tighter, the preemption case could actually change the behavior of the code versus the asynchronous or prioritized cases. That might be something that could be tested more readily with a synthetic test, where the output of the compute step might change if it is able to reference data affected by the graphics load and how much progress it makes before the GPU preempts.
 
At this point, the concept of preemption seems to be three or so layers deep.
The outermost layer being OS QoS and device preemption, which can be tripped if the GPU's graphics pipeline does not respond, whose response can be held up by a wavefront.
Then there's the GPU's ability to preempt for the purposes of AMD's diagram, which can be hobbled up by a lack of responsiveness at the wavefront level.
Then there's the wavefront instruction's level of preemption, which might be something that can happen more often with future hardware.
Yes that seems to be the kind of thing that's happening.

I wonder, though, whether D3D can only offer type 2 preemption by performing a QoS style preemption request. How long the driver/hardware takes to respond, and whether the nature of the response is something that the game programmer wants to use, are different questions.

I think it's worth remembering that games usually employ hundreds to thousands of draw calls per frame, so the latency for worst-case preemption should be tolerable. It's just clunky.
 
If synchronization were tighter, the preemption case could actually change the behavior of the code versus the asynchronous or prioritized cases. That might be something that could be tested more readily with a synthetic test, where the output of the compute step might change if it is able to reference data affected by the graphics load and how much progress it makes before the GPU preempts.
AFAIK has been done before. Yielded in fact race conditions for GCN, and none for Maxwell. But you can't claim preemption by that, even pure parallel execution is sufficient to trigger side effects. As preemption isn't necessarily atomic it's even harder to distinguish.

I wonder, though, whether D3D can only offer type 1 preemption by performing a QoS style preemption request. How long the driver/hardware takes to respond, and whether the nature of the response is something that the game programmer wants to use, are different questions.
Looks as if MS allows up to 4ms(!) for preemption, and that's already under the premise of a "reasonable workload": https://msdn.microsoft.com/en-us/library/windows/hardware/jj123501(v=vs.85).aspx / Device.Graphics.WDDM12.Render.PreemptionGranularity / D3DKMDT_GRAPHICS_PREEMPTION_GRANULARITY

I'm clueless as how to actually make use of these flags though. I can't even find any code samples via Google querying these caps.

But that's a huge gap between the specifications and the industry requirements....

PS:
Not sure if has been mentioned before, but AMD has added a Quick Response Queue a couple of weeks ago for GCN GEN2 and onwards via driver update, which is confirmed to trigger preemption to some extent.
 
[...]AMD has added a Quick Response Queue a couple of weeks ago for GCN GEN2 and onwards via driver update, which is confirmed to trigger preemption to some extent.
* Wavefront preemption, I meant to say. Hence also the requirement for GEN2 and above.
 
Just reread the paper, guess you are right, fine grained preemption is limited to swapping out compute kernels only. No idea where I got the preemption of draw calls from.
Great, that means Pascal still can't even remotely guarantee any latencies for async compute as long as there is activity on the 3D queue, which means async timewarp still conflicts with long running draw calls.


There was mention of preemption with when draw calls are being done. It wasn't the white paper though....

They specifically mentioned that preemption can now happen anytime, unlike before with draw call boundaries.

It was on nV's website too.
 
AMD Radeon R9 M480 has Polaris GPU

The R9 M480 GPU has Device ID of 67E0 , which as you might remember from this story , is actually Baffin aka Polaris 11. AMD Radeon M480 is equipped with 4GB of GDDR5 memory clocked at 1250 MHz, the GPU clock is set to 1000 MHz, Although it may vary DEPENDING ON supplier.

It appears that both R9 M480 and R9 M490 series will feature new Polaris chips. Despite rather unconfirmed nature of Polaris specifications, we do know thatfull Polaris 10 is equipped with at least 2304 Stream Processors, and Polaris 11 features at minimum 1280 unified cores.

Therefore AMD Radeon M480 may feature cut-down silicon, whereas Radeon M480X should receive full fat Polaris 11 GPU. Although AMD may decide to skip R9 4xxX and R9 485/495 series to simplify their offer.
http://videocardz.com/59465/amd-radeon-r9-m480-based-on-polaris-11-gpu
 
PS:
Not sure if has been mentioned before, but AMD has added a Quick Response Queue a couple of weeks ago for GCN GEN2 and onwards via driver update, which is confirmed to trigger preemption to some extent.
Any idea if the priority level affects scheduling on the SIMD? I always assumed it favored one kernel over another for dispatching wavefronts. No reason it couldn't pass down to SIMD scheduling though.
 
http://videocardz.com/59487/amd-polaris-11-and-10-gpus-pictured

Polaris 11
AMD-Polaris-11-GPU-fixed2.jpg


Polaris 10 die picture

AMD-Polaris-10-GPU-hero.jpg


Polaris 10 die comparison to tonga


AMD-Polaris-10-GPU-vs-Tonga-GPU.jpg
 
Status
Not open for further replies.
Back
Top