I don't like CPUs. OK *spin*

Is that good? Bad? Could be done faster? :O :runaway:
It seems excessive but I figured I'd check out the presentation to understand what it really means.

Note that the potential performance benefit from async compute is highly architecture dependent. Wider architectures and architectures that have more trouble keeping execution units busy (due to various constraints) will see a much higher benefit than architectures with less constrained scheduling.
True.
 
I'll just say the relative priority of it compared to other features depends a lot on the architecture. It's obviously something we have looked at and will continue to look at though.

You can actually go take a look for yourself in GPA or similar analysis tools - for a given game the "EU Idle" cycles are the best you are ever going to be able to make use of, and that would be with a zero overhead implementation and an application that always has async compute work available (not realistic, but yeah). Additionally the win is less clear on power-constrained SoCs as idle execution units can be power gated. That's not to say that filling in more stuff and "racing to idle" probably still isn't somewhat better, but it's less of a win than on a discrete chip where those idle execution units are just wasted cycles.

How so? This confuses me as what I am reading here (and correct me if I am wrong) is that you have tried using Async Compute within some PC discrete set-up and for your particular problem/process it was not of a large benefit, which happens all the time in code issues (once size never fits all).

But to say that a SoC system that benefits greatly from low latency, shared memory is "Less of a win" than a discrete card seems factually incorrect, power gating happens now if the device or part becomes idle (true on Console or PC hardware), but the TDP target will be to run all at speed with less idle-start-idle-start patterns. If you can make this very small and push further into the TDP ceiling more often then this in itself is a win. Having the ability to use the gaps for both Visual work and/or Compute work just means that on a console the choice of how you allocate the tasks is much wider and could be far more efficient in the right place, on a PC this means graphics only otherwise costly stalls happen which waste the cycles anyway.

This will obviously be far more widely used on 1st party developers but having a complex lighting GI engine for example along with heavy physics means that on a cross gen game the balance of how you work this and where becomes a clear divide between PC and Console and in each direction it would favour one or the other more.
 
How so? This confuses me as what I am reading here (and correct me if I am wrong) is that you have tried using Async Compute within some PC discrete set-up and for your particular problem/process it was not of a large benefit, which happens all the time in code issues (once size never fits all).

Andrew was talking from the perspective of Intel and their intergrated SoCs - not discrete. And specifically I imagne he's talking about the narrow (compared to discrete and consoles) SoC's that both Intel produce as do their main competitor in the PC market. And so for narrow SoC's like that, there wouldn't be as much spare ALU resources available for async compute anyway, thus not as great a win.

But to say that a SoC system that benefits greatly from low latency, shared memory is "Less of a win" than a discrete card seems factually incorrect,

It's less of a win with regard to the fact that there is a lot more spare ALU performance on the discrete cards that you could fill with async compute, thus leading to a larger relative improvement.

The low latency and shared memory arguments don't even need to come into it if you can already saturate your ALU's with rendering and latency tolerent GPGPU work. The shared memory/low latency aspect only gives you more choice to include latency sensitive GPGPU work in the mix as well - as you state below. However there seems to be a hanging question as to whether that's the most sensible use of async compute when you also have those nice fast SIMD engines on the CPU as well (amongst other reasons).

Having the ability to use the gaps for both Visual work and/or Compute work just means that on a console the choice of how you allocate the tasks is much wider and could be far more efficient in the right place

But how often does the "right place" show up? i.e. how often is it going to be far more efficient to run latency sensitive GPGPU work on the GPU at the same time as your normal rendering + your graphics async work instead of utilising the CPU's SIMD capabilities? Most (all I've seen) reports we've had from developers about their use of async compute seem to have been in relation to graphics work. Perhaps that will change as we move forward though.

This will obviously be far more widely used on 1st party developers but having a complex lighting GI engine for example along with heavy physics means that on a cross gen game the balance of how you work this and where becomes a clear divide between PC and Console and in each direction it would favour one or the other more.

I've asked this question a couple of times already but don't think I've had a response yet (unless I've missed it) but I wonder how easy it would be to implement an async compute path for the consoles for the latency sensitive GPGPU work and a seperate AVX / 2 path for the PC where there's lots more CPU SIMD to spare. If that's too much work, then yes, as you say, one platform if going to suffer depending on the approach taken.
 
If you can make this very small and push further into the TDP ceiling more often then this in itself is a win.
On the chips I'm talking about (i.e. almost all SoCs these days up to ~50W or so), you are always running at max TDP when running a game. i.e. you are power limited, not hardware/area limited per se. Any power that you free up will be immediately applied to boosting frequencies. Thus it's not as if idle cycles are entirely lost as the conventional wisdom would have it, the power from them is re-purposed to run the non-idle units faster. Obviously this is a high level description and there are various levels of efficiency in this sort of process, but the general point stands that you are optimizing for *power efficiency* on these chips to get higher performance, not strictly trying to fill every idle piece of hardware with work, as the chip cannot actually sustain that at max frequencies. And the frequency range for the GPU depending on available TDP is veeeery large :)

This will obviously be far more widely used on 1st party developers but having a complex lighting GI engine for example along with heavy physics means that on a cross gen game the balance of how you work this and where becomes a clear divide between PC and Console and in each direction it would favour one or the other more.
A lot of that is unavoidable due to the fact that the consoles are heavily biased towards decent GPUs but pretty weak CPUs, especially in terms of throughput-intensive tasks. On PCs, the "best" and lowest latency place to do "async compute" is often the CPU (even with a discrete card). The exception is if you need texture filtering or similar obviously, which would likely apply in a lot of the GI cases.

But yeah, power questions aside, I would argue that GCN needs async compute for efficiency more than other architectures, which isn't really a contentious statement (compare the monstrous theoretical throughput numbers on GCN to its performance in practice vs. other architectures). That's not a bad or good thing, it's just one design point and my point is that conclusions drawn about how to get the best performance out of it do not necessarily apply to the same extent to other architectures.
 
I've asked this question a couple of times already but don't think I've had a response yet (unless I've missed it) but I wonder how easy it would be to implement an async compute path for the consoles for the latency sensitive GPGPU work and a seperate AVX / 2 path for the PC where there's lots more CPU SIMD to spare.
It's not conceptually difficult with things like ISPC available these days, but the main barrier is whether you need texture filtering, or to a lesser extent access to certain GPU compressed surfaces. Neither of these would be available from the CPU. For work that doesn't need that stuff, it's pretty straightforward to do, although I would not necessarily expect folks porting from consoles to bother.

It's also possible down the road for an API to provide a "compute queue" that maps to the CPU and accepts compute shaders that compile down to AVX2 or whatever. I guess you could kind of do this today with WARP, but I'm not sure about the efficiency there. That would likely lower the barrier enough for folks to experiment with it a bit more.
 
For work that doesn't need that stuff, it's pretty straightforward to do, although I would not necessarily expect folks porting from consoles to bother.

Would they have a choice though? If a game is running latency sensitive simulation/physics code (for example) on the GPU via async compute on a shared memory console then is it even going to be possible to run that same code on a PC with the discrete GPU in the same way without it tanking the game?

It's also possible down the road for an API to provide a "compute queue" that maps to the CPU and accepts compute shaders that compile down to AVX2 or whatever. I guess you could kind of do this today with WARP, but I'm not sure about the efficiency there. That would likely lower the barrier enough for folks to experiment with it a bit more.

Yes that's the kind of thing I was thinking about. i.e. something that leaves developers free to make as much use as they wish of the consoles HSA style architecture to run latency senstitive "CPU tasks" on the GPU while having a seamless (or at least not too difficult to implement) fall back path for the PC to leave the same tasks on the CPU.
 
On the chips I'm talking about (i.e. almost all SoCs these days up to ~50W or so), you are always running at max TDP when running a game. i.e. you are power limited, not hardware/area limited per se. Any power that you free up will be immediately applied to boosting frequencies. Thus it's not as if idle cycles are entirely lost as the conventional wisdom would have it, the power from them is re-purposed to run the non-idle units faster. Obviously this is a high level description and there are various levels of efficiency in this sort of process, but the general point stands that you are optimizing for *power efficiency* on these chips to get higher performance, not strictly trying to fill every idle piece of hardware with work, as the chip cannot actually sustain that at max frequencies. And the frequency range for the GPU depending on available TDP is veeeery large :)


A lot of that is unavoidable due to the fact that the consoles are heavily biased towards decent GPUs but pretty weak CPUs, especially in terms of throughput-intensive tasks. On PCs, the "best" and lowest latency place to do "async compute" is often the CPU (even with a discrete card). The exception is if you need texture filtering or similar obviously, which would likely apply in a lot of the GI cases.

But yeah, power questions aside, I would argue that GCN needs async compute for efficiency more than other architectures, which isn't really a contentious statement (compare the monstrous theoretical throughput numbers on GCN to its performance in practice vs. other architectures). That's not a bad or good thing, it's just one design point and my point is that conclusions drawn about how to get the best performance out of it do not necessarily apply to the same extent to other architectures.

Thanks for the reply and detailed one, much appreciated :)

Agree that the entire design of the Consoles is around leaning on the GPU more than the CPU as this is the trade-off they had to make within the die and TDP budget, best bang for buck. Very interesting to hear your info on this as it closely matches my thoughts and is something I am looking into at present for my own project(s) as I am interested in how APU with HSA will affect these kinds of areas as I see it becoming the norm over a discrete set-up in the medium to long term.

Thanks again for the information, are you working on game projects or using GPCompute for heavy work loads?
 
Andrew was talking from the perspective of Intel and their intergrated SoCs - not discrete. And specifically I imagne he's talking about the narrow (compared to discrete and consoles) SoC's that both Intel produce as do their main competitor in the PC market. And so for narrow SoC's like that, there wouldn't be as much spare ALU resources available for async compute anyway, thus not as great a win.



It's less of a win with regard to the fact that there is a lot more spare ALU performance on the discrete cards that you could fill with async compute, thus leading to a larger relative improvement.

The low latency and shared memory arguments don't even need to come into it if you can already saturate your ALU's with rendering and latency tolerent GPGPU work. The shared memory/low latency aspect only gives you more choice to include latency sensitive GPGPU work in the mix as well - as you state below. However there seems to be a hanging question as to whether that's the most sensible use of async compute when you also have those nice fast SIMD engines on the CPU as well (amongst other reasons).



But how often does the "right place" show up? i.e. how often is it going to be far more efficient to run latency sensitive GPGPU work on the GPU at the same time as your normal rendering + your graphics async work instead of utilising the CPU's SIMD capabilities? Most (all I've seen) reports we've had from developers about their use of async compute seem to have been in relation to graphics work. Perhaps that will change as we move forward though.



I've asked this question a couple of times already but don't think I've had a response yet (unless I've missed it) but I wonder how easy it would be to implement an async compute path for the consoles for the latency sensitive GPGPU work and a seperate AVX / 2 path for the PC where there's lots more CPU SIMD to spare. If that's too much work, then yes, as you say, one platform if going to suffer depending on the approach taken.
Yes agree that the split is going to have to factor in for Cross-platform games and it would make far more sense and be more common to leave normal CPU tasks like AI,Physics etc to be left on CPU and leave the graphics programmers free to use all the resource of the ALU simply for GPU work which would mirror on both easily.

But games that want to lower or at least use these gaps to improve the game immersion through areas outside of simply prettier pixels it will be an interesting time and I hope we see some of that explored when it happens. cheers
 
I guess you could kind of do this today with WARP, but I'm not sure about the efficiency there. That would likely lower the barrier enough for folks to experiment with it a bit more.

You can experiment with that rather straightforwardly using C++ AMP, which exposes WARP (and e.g. textures). Doing that clearly shows that there's room for...further development, but it is nonetheless quite interesting.
 
Back
Top