It's mostly a matter of sensible design, really.I'm hoping that 'wide' CPUs like the AMD FX will become more useful. Right now most 'cores' are idling almost all the time.
I don't know if they come from Heaven or not, but jobs certainly rain from the sky. Like I said, you need to be a bit sensible about how you handle them though. Rather than looking for jobs that be split in to six or eight parts, just look for jobs that can be done six or eight at a time. There should be thousands of those.If only there were workloads from the heavens...
I can confirm that the latest shipping DX12 drivers from NV do support async compute. You'd have to ask NV how specifically it is implemented.
I actually disagree with this. Even if you have a "query" mechanism how the hardware works is actually a fair bit more complicated than that in many cases. These simplistic notions of "there are separate queues and they map to separate hardware engines!" aren't even true on AMD - there's a large amount of resource sharing and scheduling that happens both in software and hardware. There simply is no meaningful way to dumb this down in the way that consumers seem to want to understand it... it's complicated and will continue to be. You can only judge performance of a given game or algorithm and you can't meaningfully project anything like "supports async compute!!!" onto performance expectations vs. competition or otherwise.At least Khronos did the wise choice to make engine queues support queryable.. I still hope this will come on D3D12 too in a future update .
Yes, I know it is very hardware and very render-path depending, but it still better than certain "lying" driver where a software serialization maybe(?) could be a better solution and where you do not have any control over it a priori.I actually disagree with this. Even if you have a "query" mechanism how the hardware works is actually a fair bit more complicated than that in many cases. These simplistic notions of "there are separate queues and they map to separate hardware engines!" aren't even true on AMD - there's a large amount of resource sharing and scheduling that happens both in software and hardware. There simply is no meaningful way to dumb this down in the way that consumers seem to want to understand it... it's complicated and will continue to be. You can only judge performance of a given game or algorithm and you can't meaningfully project anything like "supports async compute!!!" onto performance expectations vs. competition or otherwise.
But I'm wasting my breath, this is becoming the new consumer freakout ("tessellation OMG" of 2016 already...
I actually disagree with this. Even if you have a "query" mechanism how the hardware works is actually a fair bit more complicated than that in many cases. These simplistic notions of "there are separate queues and they map to separate hardware engines!" aren't even true on AMD - there's a large amount of resource sharing and scheduling that happens both in software and hardware. There simply is no meaningful way to dumb this down in the way that consumers seem to want to understand it... it's complicated and will continue to be.
There's no "lying" at all and that's my point. You are giving the implementation work that you are telling it it is *allowed* to schedule concurrently. That are no *guarantees* on any architectures and in fact in a whole lot of cases even on AMD you won't get much real concurrency. Guaranteed concurrency would cripple performance on all architectures and be completely counter to the design and architecture of these graphics APIs.Yes, I know it is very hardware and very render-path depending, but it still better than certain "lying" driver where a software serialization maybe(?) could be a better solution and where you do not have any control over it a priori.
That's kind of going against your own point. You want a caps bit (only slight better than device IDs really) so you can check if you should submit something to multiple queues whereas today you can happily submit to multiple queues on any implementation.All this frustration 'cause I really, really... really hate seeing vendor and device id checks into code.
I know that "allowed" does not mean "guaranteed", but I still do not get the point where a driver says the hardware is allowed to schedule graphics and compute works in concurrency because the GPU should be able to take some advantage but then this hardware does not produce any better performance or - even worst - it decrease performance.There's no "lying" at all and that's my point. You are giving the implementation work that you are telling it it is *allowed* to schedule concurrently. That are no *guarantees* on any architectures and in fact in a whole lot of cases even on AMD you won't get much real concurrency. Guaranteed concurrency would cripple performance on all architectures and be completely counter to the design and architecture of these graphics APIs.
That's kind of going against your own point. You want a caps bit (only slight better than device IDs really) so you can check if you should submit something to multiple queues whereas today you can happily submit to multiple queues on any implementation.
Even in the absence of simultaneous issue from multiple queues, the deep pipelining of the system can allow work from multiple sources to be in-flight at the same time, making it concurrent.I know that "allowed" does not mean "guaranteed", but I still do not get the point where a driver says the hardware is allowed to schedule graphics and compute works in concurrency because the GPU should be able to take some advantage but then this hardware does not produce any better performance or - even worst - it decrease performance.
This is drifting towards a bitfield that is essentially a representation of a vast range of particular implementations and scenarios, where the driver/GPU is supposed to know whether the outcome will be good in the estimation of any particular developer.That's all: if a hardware is not able do to a particular job well and that particular job could dramatically decrease performance, do not handle it at all (like AMD not supporting ROVs on D3D12). I am curious to see how will all those GPUs behaves when UE will implement "async-compute" (still shorter then "graphics and compute works on different queues in concurrency"), maybe it still soon to said that a particular hardware is "lying".
As a somewhat related observation: as I'm discovering, people have serious issues when it comes to thinking about things in abstract / algorithmic terms. If this had not been the case all of these epic blowouts could be avoided i.e. the programmer would have an understanding of his workload and how to algorithmically map it optimally (for some definition of that). In some cases this might mean dispatching against separate queues, in some cases not, but in general checking for bits and trying to be excessively "to-the-metal" is misguided. Maybe some async will go on today, maybe not, however hardware evolves, and getting stuck into a particular snapshot of this evolution is IMHO ill advised. Of course this particular world-view does not really go down well with the "I'm getting 110% efficiency by being totally metally / if only I had even more control and could set on-chip control bitlines myself" crowd.Yeah queues simply don't map as directly to what's going on as you think they do. There's concurrency even without separate queues (and always has been) and even with queues there's not necessarily concurrency. No one is going to "cripple" performance by you putting things on separate queues unless you do it to yourself with the fences.
It's similar to static samplers. If your data is static go ahead and use them. Even on hardware that doesn't do any particular optimization for them it's not going to be worse than you putting it in a heap yourself.
Whole situation is quite overblown and I'm approaching the point that if someone says "supports async compute" - even if they are an ISV - I just dismiss it as them not knowing what they are talking about
The whole point of async is to increase occupancy, so if there is already a certain level of occupancy, async won't be as beneficial based on the architecture. So pretty much Oxide explanation so far have been woefully simplistic and doesn't really say anything other than their implementation of something is screwing up on 80% of user hardware. Pretty much this is what they can say without all the hoops they have been jumping everyone through so far.Whole situation is quite overblown and I'm approaching the point that if someone says "supports async compute" - even if they are an ISV - I just dismiss it as them not knowing what they are talking about
The whole point of async is to increase occupancy, so if there is already a certain level of occupancy, async won't be as beneficial based on the architecture. So pretty much Oxide explanation so far have been woefully simplistic and doesn't really say anything other than their implementation of something is screwing up on 80% of user hardware. Pretty much this is what they can say without all the hoops they have been jumping everyone through so far.
Yes, they are that simple. 8 ring buffers each are monitored per ACE, plus the ACE has access to the global data share for monitoring signals in case a fence was encountered in a buffer. If you wanted, you could use software scheduling on top on each buffer, but as it looks like, you usually don't need to, as 64 queues in total are sufficient.At least for the latest GCN versions, would the hardware queues be that simplistic? Each ACE can manage more queues in memory than there are in hardware
I don't know what the cost for switching between one virtual context and another is.The GPU in general is capable of supporting multiple virtual contexts, so the graphics command processor might not be that simple either.