AMD: Speculation, Rumors, and Discussion (Archive)

Discussion in 'Architecture and Products' started by iMacmatician, Mar 30, 2015.

Thread Status:
Not open for further replies.
  1. Jawed

    Legend

    Joined:
    Oct 2, 2004
    Messages:
    10,873
    Likes Received:
    767
    Location:
    London
    I suspect that the preemption referred to in that article is what Windows (10) does. I suspect this is a brute force kind of preemption which is asking the driver to "wipe the slate clean". So while a compute task is shown taking over in this case, it could, equally, be another graphics task. That could explain the "compute submission isn't allowed to ramp until the graphics portion goes to zero." aspect of this. Which would look to the driver (at least, perhaps the hardware too?) like what silent_guy is alluding to: a mass of in-GPU state has to be protected from the new task, which has free reign.

    I think there are multiple definitions of "preempt" that are probably in contention here.

    It's just prioritisation of work in the queue, surely. Only asynchronous shaders can go in this queue and it's just turning up their priority to 11.
     
  2. Ext3h

    Regular Newcomer

    Joined:
    Sep 4, 2015
    Messages:
    337
    Likes Received:
    294
    Just reread the paper, guess you are right, fine grained preemption is limited to swapping out compute kernels only. No idea where I got the preemption of draw calls from.
    Great, that means Pascal still can't even remotely guarantee any latencies for async compute as long as there is activity on the 3D queue, which means async timewarp still conflicts with long running draw calls.
     
  3. 3dilettante

    Legend Alpha

    Joined:
    Sep 15, 2003
    Messages:
    8,122
    Likes Received:
    2,873
    Location:
    Well within 3d
    That's slightly less brute-force than what the OS can resort to in the absence of graphics preemption, which is wipe out the application that is violating the device's response time thresholds--or crash everything, sometimes.
    The failure to respond to that kind of preemption request reminds me of the problems some posters had with the benchmark in the DX12 thread. Unless preemption support is in place, I would worry about any solution that relied on that method.

    edit: On second thought, I think I am mixing the various forms of preemption and might need to rule out the OS being key to this. Possibly, it's an example of how bad preemption can be, but the OS is not going to care about preempting something at the frame times in question. Given its concern about system QoS, the OS isn't going to concern itself with an intra-frame problem within an application's own display logic.
    The latency for any kind of queuing at the OS level could exceed what frame time there is. This could be a command in the code, or an automatic trigger the GPU has to some kind of state change.

    There would seemingly need to be, although they aren't isolated from one another.
    Being able to freely preempt a Graphics context without resetting the device does imply the ability to preempt stubborn graphics wavefronts that are refusing to complete on time.
    However, just being able to preempt at the finer level leaves open questions about the global context and ordering that a graphics wavefront and any number of concurrent wavefronts (depending on some varying amount of buffered states of indeterminate velocity through the overall pipeline) depend on.

    This goes to one of the definitions of preempt at a graphics wavefront level. It's a marketing picture, so it is free to be idealized and optimistic, but turning up wavefront priority pre-launch doesn't help in the unlucky case that the existing graphics wavefronts take an uncomfortably long time to complete.
     
    #1383 3dilettante, Apr 29, 2016
    Last edited: Apr 30, 2016
  4. Jawed

    Legend

    Joined:
    Oct 2, 2004
    Messages:
    10,873
    Likes Received:
    767
    Location:
    London
    While that might be true, I can't see how that can be concluded.

    I don't see how you can conclude that. NVidia has written about preemption. An omission on the subject of hardware support for simultaneous execution of tasks can't be taken as evidence it's not there. The paper is written for people who aren't writing rendering code.

    Also, AMD isn't guaranteeing latencies. It's merely allowing prioritisation, with extremely coarse granularity (off or on!). It could be argued that preemption on AMD is very slow (drain-out of the task and state export? - then reversed), so when it comes to making a choice about the technique for minimal-latency, just-in-time kernel execution it's more effective to use the quick response queue. Other hardware (e.g. Pascal) might be much faster at preemption and it might turn out to be preferable to use that style of task scheduling.

    The graphs in AMD's article "indicate" that the preemption case results in the slowest of all total latency. But at the same time, the compute task finishes earliest in this case. Earliest possible finish, at the cost of longest overall duration, probably isn't the choice most developers would make, whatever the application. Note though that this choice does add latency to both tasks when compared with the pure "async" case.

    Preemption can in theory result in overlapping execution of tasks, if the two tasks don't share state storage that can only hold the state of any single task at one time. Which is why I think graphics and compute should be able to overlap during state-drain and state-fill. But the picture for this in AMD's article indicates that doesn't happen.
     
  5. Jawed

    Legend

    Joined:
    Oct 2, 2004
    Messages:
    10,873
    Likes Received:
    767
    Location:
    London
    Years ago there was no time out detection and recovery in Windows - or if you prefer, it could be turned off. Yes, I have written infinite loops on the GPU and had to hit the power switch :razz:

    Many times.

    I have code that, given a sufficiently slow GPU and rich enough parameters, will enqueue kernels that run for much more than a second. It makes browsing or gaming quite jerky. Even when kernel run times are in milliseconds there can still be jerkiness if gaming at the same time...

    While that's something I take as read in this discussion, I think it's also fair to say it's not that hard to get a game (old GPU plus max resolution plus max settings) to run at less than 1fps and then you wonder how long before TDR...

    The quick response queue isn't preempting though. It's just prioritisation. It's no different than if you tell your operating system to run an ordinary application at maximum priority. If that application happens to soak your CPU cores and one or more disks to near 100% utilisation, everything else you try to do on the PC will get miserably slower.

    It's up to the graphics programmer to decide whether quick response queue is appropriate. If graphics wavefronts are taking uncomfortably long then it's their own fault.
     
  6. Anarchist4000

    Veteran Regular

    Joined:
    May 8, 2004
    Messages:
    1,439
    Likes Received:
    359
    Speculating on behavior here, but to achieve a benefit from concurrent execution it would make sense CU/SIMDs are split among a plurality of different kernels. Balancing the load based on perceived bottleneck. Some portion of the processing resources may likely be reserved. CUs could be running at the lowest possible occupancy to maintain full utilization. In the case of preemption, this could allow waves to be scheduled immediately. It would also decrease the likelihood of long running compute and graphics tasks running simultaneously. If the driver was readily detecting preemption, it could start reserving space as well. Might not work for the first frame, but it should be manageable. At the very least pieces keep moving.

    Should be 4 priority levels for compute kernels as I recall from their ISA Whitepaper. How they're exposed to APIs is another matter.
     
    ieldra likes this.
  7. CSI PC

    Veteran Newcomer

    Joined:
    Sep 2, 2015
    Messages:
    2,050
    Likes Received:
    844
    Worth remembering that paper is specifically looking at Pascal from the eyes of a Tesla implementation.
    I think some did manage to discuss pre-emption/draw call aspect with NVIDIA staff after the conference when the P100 was presented.
    While still vague I remember one generalised quote being that from a graphics perspective latency improvement will be around 5ms - the same as when Sony mentioned their own async compute benefit with PS4 development and ironically 5ms back then themselves.
    Cheers
    Orb
     
  8. CSI PC

    Veteran Newcomer

    Joined:
    Sep 2, 2015
    Messages:
    2,050
    Likes Received:
    844
    I meant to put latency in quotes because it is not clear whether they meant 5ms per frame (which ties into the PS4 development presentation point) or really latency but for some reason they missed out discussing the scheduling implementation change that has happened to Pascal and would also be relevant. *shrug*.
    Anyway just to emphasise that paper will be looking at this from the perspective of Tesla and that is an accelerator/co-processor, which would not cover questions being raised, including how the new scheduler is implemented.
    Cheers
     
  9. 3dilettante

    Legend Alpha

    Joined:
    Sep 15, 2003
    Messages:
    8,122
    Likes Received:
    2,873
    Location:
    Well within 3d
    At least some of the promise with the idea of graphics preemption is that TDR is less likely to be tripped in that case, since the GPU more gracefully respond when the OS asks for it.

    At this point, the concept of preemption seems to be three or so layers deep.
    The outermost layer being OS QoS and device preemption, which can be tripped if the GPU's graphics pipeline does not respond, whose response can be held up by a wavefront.
    Then there's the GPU's ability to preempt for the purposes of AMD's diagram, which can be hobbled up by a lack of responsiveness at the wavefront level.
    Then there's the wavefront instruction's level of preemption, which might be something that can happen more often with future hardware.

    That's my general assumption of how optimistic the marketing diagram is, given how hardware is currently rather inflexible in that regard.

    The curve would be closer to the ideal in the marketing, if the GPU could more assertively reapportion its resources. Sony's VR at least externally can take things to 120 FPS, and if future VR hardware and software moves to use that rate natively, having smaller error bars on the GPU's response time could make the tool even more of a value-add.

    Automatically determining that sweet spot could be difficult. The inverse, where a developer purposefully creates long-lived compute tasks and possibly uses synchronization with atomics rather than queue-level barriers could engineer a low-latency response.

    If synchronization were tighter, the preemption case could actually change the behavior of the code versus the asynchronous or prioritized cases. That might be something that could be tested more readily with a synthetic test, where the output of the compute step might change if it is able to reference data affected by the graphics load and how much progress it makes before the GPU preempts.
     
  10. Jawed

    Legend

    Joined:
    Oct 2, 2004
    Messages:
    10,873
    Likes Received:
    767
    Location:
    London
    Yes that seems to be the kind of thing that's happening.

    I wonder, though, whether D3D can only offer type 2 preemption by performing a QoS style preemption request. How long the driver/hardware takes to respond, and whether the nature of the response is something that the game programmer wants to use, are different questions.

    I think it's worth remembering that games usually employ hundreds to thousands of draw calls per frame, so the latency for worst-case preemption should be tolerable. It's just clunky.
     
  11. Ext3h

    Regular Newcomer

    Joined:
    Sep 4, 2015
    Messages:
    337
    Likes Received:
    294
    AFAIK has been done before. Yielded in fact race conditions for GCN, and none for Maxwell. But you can't claim preemption by that, even pure parallel execution is sufficient to trigger side effects. As preemption isn't necessarily atomic it's even harder to distinguish.

    Looks as if MS allows up to 4ms(!) for preemption, and that's already under the premise of a "reasonable workload": https://msdn.microsoft.com/en-us/library/windows/hardware/jj123501(v=vs.85).aspx / Device.Graphics.WDDM12.Render.PreemptionGranularity / D3DKMDT_GRAPHICS_PREEMPTION_GRANULARITY

    I'm clueless as how to actually make use of these flags though. I can't even find any code samples via Google querying these caps.

    But that's a huge gap between the specifications and the industry requirements....

    PS:
    Not sure if has been mentioned before, but AMD has added a Quick Response Queue a couple of weeks ago for GCN GEN2 and onwards via driver update, which is confirmed to trigger preemption to some extent.
     
  12. Ext3h

    Regular Newcomer

    Joined:
    Sep 4, 2015
    Messages:
    337
    Likes Received:
    294
    * Wavefront preemption, I meant to say. Hence also the requirement for GEN2 and above.
     
  13. Razor1

    Veteran

    Joined:
    Jul 24, 2004
    Messages:
    4,232
    Likes Received:
    749
    Location:
    NY, NY

    There was mention of preemption with when draw calls are being done. It wasn't the white paper though....

    They specifically mentioned that preemption can now happen anytime, unlike before with draw call boundaries.

    It was on nV's website too.
     
    CSI PC likes this.
  14. pharma

    Veteran Regular

    Joined:
    Mar 29, 2004
    Messages:
    2,928
    Likes Received:
    1,626
    AMD Radeon R9 M480 has Polaris GPU

    http://videocardz.com/59465/amd-radeon-r9-m480-based-on-polaris-11-gpu
     
  15. Anarchist4000

    Veteran Regular

    Joined:
    May 8, 2004
    Messages:
    1,439
    Likes Received:
    359
    Any idea if the priority level affects scheduling on the SIMD? I always assumed it favored one kernel over another for dispatching wavefronts. No reason it couldn't pass down to SIMD scheduling though.
     
  16. lanek

    Veteran

    Joined:
    Mar 7, 2012
    Messages:
    2,469
    Likes Received:
    315
    Location:
    Switzerland
  17. fellix

    fellix Hey, You!
    Veteran

    Joined:
    Dec 4, 2004
    Messages:
    3,489
    Likes Received:
    400
    Location:
    Varna, Bulgaria
  18. dbz

    dbz
    Newcomer

    Joined:
    Mar 21, 2012
    Messages:
    98
    Likes Received:
    41
  19. Razor1

    Veteran

    Joined:
    Jul 24, 2004
    Messages:
    4,232
    Likes Received:
    749
    Location:
    NY, NY
  20. trinibwoy

    trinibwoy Meh
    Legend

    Joined:
    Mar 17, 2004
    Messages:
    10,430
    Likes Received:
    433
    Location:
    New York
Loading...
Thread Status:
Not open for further replies.

Share This Page

  • About Us

    Beyond3D has been around for over a decade and prides itself on being the best place on the web for in-depth, technically-driven discussion and analysis of 3D graphics hardware. If you love pixels and transistors, you've come to the right place!

    Beyond3D is proudly published by GPU Tools Ltd.
Loading...