Discussion in 'Rendering Technology and APIs' started by A1xLLcqAgt0qc2RyMz0y, Jul 29, 2015.
If only there were workloads from the heavens...
It's mostly a matter of sensible design, really.
I don't know if they come from Heaven or not, but jobs certainly rain from the sky. Like I said, you need to be a bit sensible about how you handle them though. Rather than looking for jobs that be split in to six or eight parts, just look for jobs that can be done six or eight at a time. There should be thousands of those.
Naughty Dog gave a nice presentation on their PS4 engine, which is set up fairly similarly to Apple's Grand Central Dispatch. In a nutshell, rather than having a main thread on one core which then spawns sub-threads on other cores, you create a job-agnostic worker thread for each core, however many cores that may be. When writing your code, you divide it up as atomically as possible, ideally packaging just about every function call as a discreet job.
Those jobs are then placed on to a master job queue, as assigned to worker threads as the latter become available. Since most function calls don't really depend on anything else, it's not difficult at all to find six or twelve that could be executed simultaneously, given the thousands of lines of code in a typical app, and the fact that most systems will be running several apps at once. Some functions actually do depend on input that's to be provided by other functions, but those can be placed in to a serial queue where they are guaranteed to be executed in order, but again, each specific job will be executed by whichever worker thread happens to become available at the right time.
I know a few games having used GCD, the alternative is Cilk (each thread has its own work queue + work stealing breadth first, execution depth first), both are fine, of course the problem is to get everyone on board with using it (shouldn't be too hard but it may), and making work items the right size (too small you have too much overhead, too big and you wait too long on sync for the last one to finish).
Today you also have to consider which processor to use and which algorithm to use (since CPU & GPU differ quite a bit, although if you target AVX & GPU you could write SPMD code for either...).
It's a bit sad GPU can't seem (or I missed it) to be able to run asynchronously, they are still slave to the CPU
D3D12 certainly is a good step forward, but IMO (expressed elsewhere also) not having gfxVirtualAlloc/gfxVirtualFree is a major pain that will cost thousands of man-hours efforts needlessly ! A bad move.
We certainly have more interesting things to do than manage this pseudo virtual memory horror, IBM ARC (or more likely CART) algorithm would likely be very good with gfxVirtualAlloc/gfxVirtualFree and would free dev time from resource management to concentrate it on more interesting things, such as rendering/physics algorithms instead.
I don't know how Vulkan will handle it, if they dare go one step further by providing those functions, it will greatly enhance its chances to be broadly used.
A representative from Oxide confirmed that the latest drivers from Nvidia do support async compute in D3D12:
Does that make any sense when the second batch of review of AOTS use the same driver version and nV's performance was back to where it should have been in DX12? Something just isn't adding up here.
What does "drivers support async compute" even mean? All DX12 drivers support separate 3D and compute queues. The implementation details are the only interesting things.
At least Khronos did the wise choice to make engine queues support queryable.. I still hope this will come on D3D12 too in a future update .
I also am curious to know how Pascal will implement them...
I actually disagree with this. Even if you have a "query" mechanism how the hardware works is actually a fair bit more complicated than that in many cases. These simplistic notions of "there are separate queues and they map to separate hardware engines!" aren't even true on AMD - there's a large amount of resource sharing and scheduling that happens both in software and hardware. There simply is no meaningful way to dumb this down in the way that consumers seem to want to understand it... it's complicated and will continue to be. You can only judge performance of a given game or algorithm and you can't meaningfully project anything like "supports async compute!!!" onto performance expectations vs. competition or otherwise.
But I'm wasting my breath, this is becoming the new consumer freakout ("tessellation OMG") of 2016 already...
Yes, I know it is very hardware and very render-path depending, but it still better than certain "lying" driver where a software serialization maybe(?) could be a better solution and where you do not have any control over it a priori.
Maybe with the upcoming dedicated GPU architectures all this "async-compute" will just go down (hopefully without make a 650€£$ GPU perform like a 350-400 GPU).
All this frustration 'cause I really, really... really hate seeing vendor and device id checks into code.
Uttargram 2016 pls.
There's no "lying" at all and that's my point. You are giving the implementation work that you are telling it it is *allowed* to schedule concurrently. That are no *guarantees* on any architectures and in fact in a whole lot of cases even on AMD you won't get much real concurrency. Guaranteed concurrency would cripple performance on all architectures and be completely counter to the design and architecture of these graphics APIs.
That's kind of going against your own point. You want a caps bit (only slight better than device IDs really) so you can check if you should submit something to multiple queues whereas today you can happily submit to multiple queues on any implementation.
I know that "allowed" does not mean "guaranteed", but I still do not get the point where a driver says the hardware is allowed to schedule graphics and compute works in concurrency because the GPU should be able to take some advantage but then this hardware does not produce any better performance or - even worst - it decrease performance.
All this compared with other architectures that they are not able to handle graphics and compute works in concurrency on hardware so the driver serialize all and does a better works. That's all: if a hardware is not able do to a particular job well and that particular job could dramatically decrease performance, do not handle it at all (like AMD not supporting ROVs on D3D12). I am curious to see how will all those GPUs behaves when UE will implement "async-compute" (still shorter then "graphics and compute works on different queues in concurrency"), maybe it still soon to said that a particular hardware is "lying".
About device id: a cap bit is setted by the driver, so it is always guaranteed the code will not chose the wrong render path (if the driver is not bugged of course). Checking the device id is a risk: you can always forgot a particular device id (like mobile GPU IDs) or you can forgot to handle a particular ID hardware that were not on the market when you write that piece of code, and of course you need to update the code for hardware available only after your application is released if you want to take care about new hardware.
Anyway, I am still curious how Pascal GPUs will handle all this and I am particularly confident they will change a lot of things. ... (hopefully O_O)
Even in the absence of simultaneous issue from multiple queues, the deep pipelining of the system can allow work from multiple sources to be in-flight at the same time, making it concurrent.
Taking advantage of parallel dispatch to allow for a specific form of (possibly)concurrent execution as marketed by AMD is also not going to guarantee better performance. Tasks do not need to play well together or necessarily provide a net win over the additional costs of parallel bookkeeping and resource contention/arbitration.
There are some signs that some form of concurrency was happening in earlier synthetic testing for some Nvidia chips, but how much is sufficient to be deserving of API recognition, and what about the various scenarios where in the end the lack of a particular implementation choice lead to no appreciable difference over the one that did?
This is drifting towards a bitfield that is essentially a representation of a vast range of particular implementations and scenarios, where the driver/GPU is supposed to know whether the outcome will be good in the estimation of any particular developer.
Yeah queues simply don't map as directly to what's going on as you think they do. There's concurrency even without separate queues (and always has been) and even with queues there's not necessarily concurrency. No one is going to "cripple" performance by you putting things on separate queues unless you do it to yourself with the fences.
It's similar to static samplers. If your data is static go ahead and use them. Even on hardware that doesn't do any particular optimization for them it's not going to be worse than you putting it in a heap yourself.
Whole situation is quite overblown and I'm approaching the point that if someone says "supports async compute" - even if they are an ISV - I just dismiss it as them not knowing what they are talking about
As a somewhat related observation: as I'm discovering, people have serious issues when it comes to thinking about things in abstract / algorithmic terms. If this had not been the case all of these epic blowouts could be avoided i.e. the programmer would have an understanding of his workload and how to algorithmically map it optimally (for some definition of that). In some cases this might mean dispatching against separate queues, in some cases not, but in general checking for bits and trying to be excessively "to-the-metal" is misguided. Maybe some async will go on today, maybe not, however hardware evolves, and getting stuck into a particular snapshot of this evolution is IMHO ill advised. Of course this particular world-view does not really go down well with the "I'm getting 110% efficiency by being totally metally / if only I had even more control and could set on-chip control bitlines myself" crowd.
The whole point of async is to increase occupancy, so if there is already a certain level of occupancy, async won't be as beneficial based on the architecture. So pretty much Oxide explanation so far have been woefully simplistic and doesn't really say anything other than their implementation of something is screwing up on 80% of user hardware. Pretty much this is what they can say without all the hoops they have been jumping everyone through so far.
This... this plus being piratically the only "intensive" D3D12 application (UE4 actually does not use "async compute", but I did not have a lock on the upcoming 4.11), plus not having a great documentation and documentation on how green GPUs, especially how they handle multi-engine scenarios (c'mon, CUDA reference is not a valid replacement), plus other NDA docs I read saying other things... All this.... ME = still confused about green GPUs multi-engine support (especially graphics + compute).
Anyway, I really appreciate all your answers
I think I finally figured it out, why AMDs GCN architecture with the dedicated ACE units beats Maxwell so badly when using async compute, aside from the bugs Nvidias driver suffered from initially.
There is a subtile difference in how the scheduling works.
For command processors which sport only a single hardware queue, like the GPC on both GCN and Maxwell, the scheduling is done by the OS. Every single (not trivially resolvable*) synchronization point inevitably requires a flush and a round trip to the CPU, since the queue is shared and requires cooperative scheduling by the operating system.
Except that the hardware queues provided by the ACEs for compute queues are not shared. They are dedicated. They can simply stall on synchronization points, not requiring re-scheduling.
That comes at the cost of only providing a limited number of queues, but also greatly reduces the cost of synchronization points inside any queue allocated in hardware.
* By trivially resolvable I mean stuff like barriers which resolve eventually either way, but also fences which are resolved automatically by an signal which is queued earlier in the same queue.
** Unfortunately this also works the other way around - splitting your workload onto multiple queues can result in fences now no longer being trivial or for free, actually requiring live scheduling.
At least for the latest GCN versions, would the hardware queues be that simplistic? Each ACE can manage more queues in memory than there are in hardware, and then there is the block of ACE hardware that was relabeled HWS with the expanded virtualization capability, which can dynamically map between queues in hardware and multiple memory contexts. The GPU in general is capable of supporting multiple virtual contexts, so the graphics command processor might not be that simple either.
Yes, they are that simple. 8 ring buffers each are monitored per ACE, plus the ACE has access to the global data share for monitoring signals in case a fence was encountered in a buffer. If you wanted, you could use software scheduling on top on each buffer, but as it looks like, you usually don't need to, as 64 queues in total are sufficient.
I don't know about the HWS units, if I'm interpreting this correctly, then they differ mostly from a plain ACE by having an programmable controller attached to a pair of ACEs.
I don't know what the cost for switching between one virtual context and another is.
I actually DO suspect, that if virtualization features on the graphics command processor were standard for consumer GPUs, and if these features came not only with cooperative/preemptive scheduling, but with multiple monitored queues in hardware, that we then could actually get the option to map (selected) 3D queues to hardware queues as well, cutting the synchronization overhead down to the same level as the ACEs can achieve.