Asynchronous Compute : what are the benefits?

Not all algorithms are able to be parallelized. Coming out and saying 'run everything in parallel' is not a technical detail. To simply wave off the difficulty of making things run in parallel (especially with any sort of efficiency) is exactly the sort of saying I'd expect to hear from those who have no understanding. Sometimes the overhead of running in coherent parallel fashion outweigh simply running sequentially. There are always other costs or bottlenecks which have to be accounted for. No amount of bad-management mantras and no understandings ("make it faster") can wave off those factors.

If the algorithm is "bad" it will run equally bad on any modern CPU also.
But if your algorithm runs faster on CPU than on GPU - it's your problem, you did not optimize it and then CPU executor did it for you.
 
Well that is what I do: if the algorithm is not parallel, or even if it is foreseeable that it will not scale well...I don't consider it and search for something else. I know, as a scientist I have the luxury of time and no pressure and complete freedom...so it might not be that fair to compare this to game development...

It's totally fair to compare it.
Why nobody uses non-linear radiosity algorithms?
Why nobody uses shadow volumes?
Etc.
In interactive graphics everybody is cool with "no single-threaded or dependent algos".
It's time to accept it as a new worldview.
 
There is no such thing. GPUs _are_ strong. It's your code that's bad.
Current GPUs are optimized for very coarse workloads that have a very significant amount of math being done per memory access, and a very limited amount of divergence in the execution of each data element.
There are workloads for which the more limited SIMD width of CPUs, or SPEs, is sufficient to capture the bulk of the data parallelism that can be extracted per-cycle before divergence or just plain lack of extra work to do kicks in.

If, on a GPU, you can find extra arithmetic work, leverage specialized hardware, or the cache miss rates for the CPU are equivalently high, it's a GPU strong point.
If the cache hit rate is very high, or the granularity of the workload is very fine due to limited arithmetic density or complex control flow, it's not a GPU strong point with the architectures in question.

The CPU has an execution pipeline that can handle a far more arbitrary code flow, and in terms of maintaining coherence and memory ordering, it is far superior. The programming model for HSA has a bunch of restrictions in how programs should be structured in order to create execution points where synchronization and write-back can be initiated by specialized writes or reads in order to permit global visibility and ordering. There are various gotchas that either the code writer or compiler will need to watch out for.
CPU code, to a point, just does it.
The execution model for the GPU encourages the smaller simple program approach, and it requires extra hoops CPUs do not. It isn't required that a CPU know ahead of time how many registers the code it's running will need. It also doesn't bloat the register allocation if one of the branch paths blows up the register requirements.
CPUs just run the code.

On a CPU, there is much more cache per thread, which in the face of external memory accesses still incurring a bandwidth+latency+power penalty means there are workloads where opting for fewer threads that stay on-chip will get the job done.

If developers can ditch old thinking and write new code, that fits into "thousand threads" paradigm - they will rule the tomorrow's world.
Thousands of threads that branch in thousands of different directions?
Try that with the Orbis GPU.

Unless your tomorrow's world has dispensed with physical reality, it doesn't negate the implementation details of the specific chip in question. They aren't pretending that Orbis can do what you claim, and whatever your vision is for the future, it can't transplant itself into the PS4's hardware platform.
 
Current GPUs are optimized for very coarse workloads that have a very significant amount of math being done per memory access, and a very limited amount of divergence in the execution of each data element.

That's cool, data locality and stuff.

There are workloads for which the more limited SIMD width of CPUs, or SPEs, is sufficient to capture the bulk of the data parallelism that can be extracted per-cycle before divergence or just plain lack of extra work to do kicks in.

Workloads do not descend from heavens. Application developer creates them.
If the data locality is bad, or code branches like crazy - you wrote a bad code.
Yes it will suck less on CPU than on GPU, but it will still suck.

If, on a GPU, you can find extra arithmetic work, leverage specialized hardware, or the cache miss rates for the CPU are equivalently high, it's a GPU strong point.
If the cache hit rate is very high, or the granularity of the workload is very fine due to limited arithmetic density or complex control flow, it's not a GPU strong point with the architectures in question.

If cache hit rate is very high it will run equally good on GPU. The only case here is if the cache hit rate is high only on specific cache sizes > GPU cache size. Looks too artificial to me.
If you have low arithmetic density: do not use CPU at all. Use calculator. :)
And complex control flow = bad code. Or it's infrastructure code = no need for speed.

The CPU has an execution pipeline that can handle a far more arbitrary code flow, and in terms of maintaining coherence and memory ordering, it is far superior. The programming model for HSA has a bunch of restrictions in how programs should be structured in order to create execution points where synchronization and write-back can be initiated by specialized writes or reads in order to permit global visibility and ordering. There are various gotchas that either the code writer or compiler will need to watch out for.
CPU code, to a point, just does it.

Cost. I've heard that already. We don't want to optimize, we want hardware to do the job, and hire less (or less expensive) developers.

The execution model for the GPU encourages the smaller simple program approach, and it requires extra hoops CPUs do not. It isn't required that a CPU know ahead of time how many registers the code it's running will need. It also doesn't bloat the register allocation if one of the branch paths blows up the register requirements.
CPUs just run the code.

On a CPU, there is much more cache per thread, which in the face of external memory accesses still incurring a bandwidth+latency+power penalty means there are workloads where opting for fewer threads that stay on-chip will get the job done.

CPU is good for bad/old/legacy code. I know that.

Bottom line:
CPU and GPU try to leverage the same problem: how to keep caches full in such a fashion that memory bandwidth is saturated all the time.
And IMHO GPUs do it much better.
 
Current GPUs are optimized for very coarse workloads that have a very significant amount of math being done per memory access, and a very limited amount of divergence in the execution of each data element.
There are workloads for which the more limited SIMD width of CPUs, or SPEs, is sufficient to capture the bulk of the data parallelism that can be extracted per-cycle before divergence or just plain lack of extra work to do kicks in.

If, on a GPU, you can find extra arithmetic work, leverage specialized hardware, or the cache miss rates for the CPU are equivalently high, it's a GPU strong point.
If the cache hit rate is very high, or the granularity of the workload is very fine due to limited arithmetic density or complex control flow, it's not a GPU strong point with the architectures in question.

The CPU has an execution pipeline that can handle a far more arbitrary code flow, and in terms of maintaining coherence and memory ordering, it is far superior. The programming model for HSA has a bunch of restrictions in how programs should be structured in order to create execution points where synchronization and write-back can be initiated by specialized writes or reads in order to permit global visibility and ordering. There are various gotchas that either the code writer or compiler will need to watch out for.
CPU code, to a point, just does it.
The execution model for the GPU encourages the smaller simple program approach, and it requires extra hoops CPUs do not. It isn't required that a CPU know ahead of time how many registers the code it's running will need. It also doesn't bloat the register allocation if one of the branch paths blows up the register requirements.
CPUs just run the code.

On a CPU, there is much more cache per thread, which in the face of external memory accesses still incurring a bandwidth+latency+power penalty means there are workloads where opting for fewer threads that stay on-chip will get the job done.


Thousands of threads that branch in thousands of different directions?
Try that with the Orbis GPU.

Unless your tomorrow's world has dispensed with physical reality, it doesn't negate the implementation details of the specific chip in question. They aren't pretending that Orbis can do what you claim, and whatever your vision is for the future, it can't transplant itself into the PS4's hardware platform.

PS4 GPU has been modified for fine grained computing.






start around 41:40
 
P.S. any problem that needs high performance - will run better on GPU.
P.P.S. "high performance" = high arithmetic demand + high memory bandwidth requirements, ask HPC guys.
 
Workloads do not descend from heavens. Application developer creates them.
If the data locality is bad, or code branches like crazy - you wrote a bad code.
Yes it will suck less on CPU than on GPU, but it will still suck.
A workload is the code and data set the hardware is tasked to run on.
Those are informed by the constraints of the problem the programmer is trying to solve.

If cache hit rate is very high it will run equally good on GPU. The only case here is if the cache hit rate is high only on specific cache sizes > GPU cache size. Looks too artificial to me.
It's not artificial when cache sizes are so much larger per thread on the CPU. There are 4 MB of L2 cache for up to 8 CPU threads in Orbis.
Assuming equivalent usage, it's 512 KB of cache to play with.

There is 3/4 MB for up to 720 wavefronts, which is 46,080 threads--if you buy into the marketing.
That's about 1K per wavefront, and 17 bytes per "thread".
How many orders of magnitude are necessary before the example becomes not artificial?

edit:
My apologies, I was mentally using a larger GPU.
Orbis has 512KB, so cut the per-thread cache allocation as necessary.


If you have low arithmetic density: do not use CPU at all. Use calculator. :)
I don't normally point out garbage argumentation.

And complex control flow = bad code. Or it's infrastructure code = no need for speed.
Graphics drivers/compilers.
GPU compute run-time managers.
Or are you asserting Orbis might not need those?

CPU is good for bad/old/legacy code. I know that.
They are also good at code that requires fine-grained synchronization, and there are simply problems that include that.
There are data sets that fall below the minimum the GPU needs for utilization. This is still the case for GCN.
Reduction operations are common, and it follows that if the GPU does that enough times, eventually the data it works on falls below the minimum.
See how AMD is trying to sell HSA for image recognition. The GPU is faster for the initial broad sweeps, but it falls on its face as the number of tiles drops.

Bottom line:
CPU and GPU try to leverage the same problem: how to keep caches full in such a fashion that memory bandwidth is saturated all the time.
That is not what the CPUs try to do, they can't fully schedule around a miss that goes off chip.
Under much of their operating range, CPUs do their best to prevent off-die access.
GPUs start from a pessimistic case where they assume off-die access is extremely routine.


PS4 GPU has been modified for fine grained computing.
Fine-grained relative to what?
For previous-gen GPUs, sure.
 
Last edited by a moderator:
A workload is the code and data set the hardware is tasked to run on.
Those are informed by the constraints of the problem the programmer is trying to solve.

This one somehow assumes that workload was not developed for the hardware in mind. Or am I seeing things?

It's not artificial when cache sizes are so much larger per thread on the CPU. There are 4 MB of L2 cache for up to 8 CPU threads in Orbis.
Assuming equivalent usage, it's 512 KB of cache to play with.

There is 3/4 MB for up to 720 wavefronts, which is 46,080 threads--if you buy into the marketing.
That's about 1K per wavefront, and 17 bytes per "thread".
How many orders of magnitude are necessary before the example becomes not artificial?

I will address this below.

Graphics drivers/compilers.
GPU compute run-time managers.
Or are you asserting Orbis might not need those?

I assume these tasks do not need the performance.
They can run on a slow CPU without a problem.

They are also good at code that requires fine-grained synchronization, and there are simply problems that include that.
There are data sets that fall below the minimum the GPU needs for utilization. This is still the case for GCN.
Reduction operations are common, and it follows that if the GPU does that enough times, eventually the data it works on falls below the minimum.
See how AMD is trying to sell HSA for image recognition. The GPU is faster for the initial broad sweeps, but it falls on its face as the number of tiles drops.

What prevents you from feeding it more tiles as the data you work on is reduced in size?
Or you are speaking about the thread granularity issue?
Still I think we are side-tracking here, I'd like game development examples. I cannot think of a good reduce task here.

That is not what the CPUs try to do, they can't fully schedule around a miss that goes off chip.
Under much of their operating range, CPUs do their best to prevent off-die access.
GPUs start from a pessimistic case where they assume off-die access is extremely routine.

Working set of modern game is several gigabytes in size = "off-die access is extremely routine".
To address the point above: cache miss is routine for games. You can even see it by yourself in any modern PC game (just analyze it with profiler or any specialized tools).
 
P.S. any problem that needs high performance - will run better on GPU.
P.P.S. "high performance" = high arithmetic demand + high memory bandwidth requirements, ask HPC guys.
Some HPC people think turnaround times for computations measured in days or weeks are great.


This one somehow assumes that workload was not developed for the hardware in mind. Or am I seeing things?
This is backwards.
The hardware was developed with a workload in mind.

I assume these tasks do not need the performance.
They can run on a slow CPU without a problem.
I think the system that allows jobs to be farmed out to the GPU has a performance impact.
GPU drivers on desktop systems can themselves become a performance limiter, and that's with cores much more powerful than Jaguar.

What prevents you from feeding it more tiles as the data you work on is reduced in size?
The GPU was tasked with analyzing an image. The performance criterion is how fast the analysis can complete.

Still I think we are side-tracking here, I'd like game development examples. I cannot think of a good reduce task here.
It's a multimedia example, and for user-facing functionality the latency factor weighs heavily.
Sony also expects the GPU to perform the work for image recognition for its camera.


Working set of modern game is several gigabytes in size = "off-die access is extremely routine".
To address the point above: cache miss is routine for games. You can even see it by yourself in any modern PC game (just analyze it with profiler or any specialized tools).
What's the rate for CPUs?
Why does their utilization rate of mere tens of GB/s of memory bandwidth rarely peak outside of benchmarks?
GPUs almost assume that every wavefront memory operation could take a full trip to memory, and they can do so without affecting arithmetic throughput if enough math operations are available.
 
This one somehow assumes that workload was not developed for the hardware in mind. Or am I seeing things?
Let me add my two cents.

Let's assume a programmer has a certain problem to solve. Usually there is more than one algorithm possible to get the job done. Depending on the problem, it may be very easy to roll out an efficient massively parallel algorithm, or it is not. For quite some problems the "natural" algorithm presents itself with only modest parallelism, complicated control flow and all sort of things GPUs don't like very much. But they may run with a very high performance on CPUs.

As I said in the beginning, there is often more than just one algorithm to solve a problem. That could be a point to explore these. One may find an alternative, which exposes more parallelism and therefore scales much better to a high number of threads, and also avoids the common performance pitfalls of GPUs. Thus, this algorithm is better suited to GPUs as it is to CPUs and running the same algorithm on CPU and GPU gives the GPU an performance advantage. But that doesn't say at all, that it is faster than the original algorithm running on a CPU. It could be an inherently worse algorithm in the sense, that it has a huge upfront computational or memory costs, that it scales much worse with the problem size or whatever. It simply means that the problem in question may be not the best one to run on a GPU. And not everyone has the time to do years of research into finding yet another parallel implementation which may or may not be faster than the original one. Maybe, in a few years, when GPUs have further evolved, someone may find a better fit. But it doesn't help you now.

To sum it up, sometimes it isn't as easy as saying that devs just have to write better code. There may be fundamental restriction to what one can do.
 
if your algorithm runs faster on CPU than on GPU - it's your problem, you did not optimize it and then CPU executor did it for you.
CPU beats GPU in all problems that cannot be sliced to tens of thousands of independent threads. If you try to run large amount of smaller problems on the GPU, you will stall the GPU (and the CPU because of draw/dispatch call setup overhead). It's much better just to crunch these small problems using AVX (preferably inside the CPU L1 cache).

Games have lots of different types of entities that do not come in the required 10k+ numbers (needed for efficient GPU execution). Examples: currently active AI enemies (<50), ray casts needed for a single game logic frame (<100), path finding requests per frame (<50), active physics objects (game logic rigid bodies, not particle debris physics, <500), script objects running (<100), triggers running (<100), etc. Of course there are specific game types that have huge armies of enemies or are purely based on physics simulation, and those types of games might benefit more from GPU processing (assuming the branching is coherent and the dependencies are simple).

Usually algorithm complexities rise when you move from a sequential algorithm to a parallel algorithm. O(log(n)) sequential algorithm can for example become O(n) when it is parallelized. And it is quite common that O(n) sequential algorithms became O(n log n) parallel algorithms. So you pay some overhead in algorithm complexity. Sometimes it's just better to run the algorithm on CPU, because you don't want to waste 2x-5x GPU flops to run it on GPU, even if it would be finish faster on the GPU. GPU is a resource, and you rather spend it doing something very efficient, such as graphics rendering (or simple batch processing).
 
Last edited by a moderator:
If you try to run large amount of smaller problems on the GPU, you will stall the GPU (and the CPU because of draw/dispatch call setup overhead). It's much better just to crunch these small problems using AVX (preferably inside the CPU L1 cache).

The question is interesting. However, supposed you have a beefy front-end, what prevents you to actually run these small jobs with a relative performance penalty, but issue many of them in the GPU schedule?
In the end, you MAY have other jobs running in your GPU (like 3d ones) that aids you to hide the higher latencies for those little extra jobs.
 
CPU beats GPU in all problems that cannot be sliced to tens of thousands of independent threads. If you try to run large amount of smaller problems on the GPU, you will stall the GPU (and the CPU because of draw/dispatch call setup overhead).

My (admittedly layman) impression is that this is precisely what Sony has worked with AMD on - to prevent smaller jobs stalling the GPU, to provide a smaller overhead from the CPU side, and to be more bandwidth efficient.

Sometimes it's just better to run the algorithm on CPU, because you don't want to waste 2x-5x GPU flops to run it on GPU, even if it would be finish faster on the GPU. GPU is a resource, and you rather spend it doing something very efficient, such as graphics rendering (or simple batch processing).

It seems a given that there are algorithms that the CPU is better at - otherwise, why even have one (or more than one or two cores of one)? But if you have GPU cycles to spare (which Sony has suggested is typically the case), and you don't have CPU cycles to spare, then it still makes sense (assuming it's not too much work to refactor the algorithm).

I fully agree that psorcerer's blanket statements are not doing his argument much service, but in general I would say there is some truth in it, in that data-focussed setups can be more efficient. Hasn't this been discussed endlessly in the light of using the SPE's to their advantage, and doesn't this hold to CUs to a large extent as well?

I don't feel we have a lot of stuff currently to compare the way CUs can be used in the next-gen consoles with for modern game engines, but we do see Sony suggesting that the way SPEs have been used to farm off jobs is made possible by the current configuration of CUs, scheduling and prioritisation, datapaths and memory access.
 
The question is interesting. However, supposed you have a beefy front-end, what prevents you to actually run these small jobs with a relative performance penalty, but issue many of them in the GPU schedule?
In the end, you MAY have other jobs running in your GPU (like 3d ones) that aids you to hide the higher latencies for those little extra jobs.

This is my layman understanding.


its not just "front end", it's resources per alu, a gpu hides latency by doing something else. but it doesn't have anywhere near the local storage or hardware based performance improvers ( prefetch,predict,instruction cache etc) per ALU a CPU does. To do something else you have to have that data and instruction at the execution unit. That takes a very long time on a gpu. So if you have lots of small complex jobs waiting on fetching complex memory (noncontiguous) access etc you can have all your ALU's waiting for that data and you stall the GPU. A GPU wants lots of simple tasks with simple data structures.

When you look at the evolution of GPU's your really looking at the evolution of data structures and tasks the GPU can effectively support.
 
Hasn't this been discussed endlessly in the light of using the SPE's to their advantage, and doesn't this hold to CUs to a large extent as well?
SPE code can effectively operate as standard multicore, running multiple independent jobs simultaneously even using the same basic algorithms as CPU code (though with data optimisations), or running the same task in parallel across cores when it's an easy fit. GPGPU requires a completely different approach, running the same task in parallel across many multiple processors. It's not always easy to a take a fundamentally linear task or task with data dependencies and break it into tiny units each processed in parallel. Sometimes the algorithm that can deal with a task broken down like that is far less efficient that a linear algorithm run on a CPU core, such that it makes no sense to move everything to GPGPU.
 
But even then, what processes are there on the CPU that require so much CPU resources? Surely typically you will have a few really hungry ones that process a lot of data, that then make sense to move to CU? Obviously this is not something commonly done on PC because the communication between CU and CPU isn't optimal, they don't have access to the same memory pool, there's caching issues, etc.

More importantly, I'm getting the impression that a lot of GPGPU talk today is informed by the state of technology of 2006 rather than modern architecture, interfaces and SDKs.

To take a quote from NVidia's PR material:

"GPUs have evolved to the point where many real-world applications are easily implemented on them and run significantly faster than on multi-core systems. Future computing architectures will be hybrid systems with parallel-core GPUs working in tandem with multi-core CPUs.'
Professor Jack Dongarra
Director of the Innovative Computing Laboratory
The University of Tennessee

That future is here this November? ;)
 
"GPUs have evolved to the point where many real-world applications are easily implemented on them and run significantly faster than on multi-core systems. Future computing architectures will be hybrid systems with parallel-core GPUs working in tandem with multi-core CPUs.'
Professor Jack Dongarra
Director of the Innovative Computing Laboratory
The University of Tennessee

I'm confused, how is that any different than people who are saying that some work fits CPU's better?

I've not seen anyone say that some tasks can not run better on GPU's, and that there is possibility to move currently CPU done tasks to GPGPU.
 
That future is here this November? ;)
It might well be. I just take considerable umbrage at psorcerer effectively saying all devs are stupid/lazy for not writing their code as efficiently parallelised GPU compute. In areas where we are seeing GPGPU acceleration, it's been after a lot of time and research. The notion that devs should just write their games different, easy as, is insulting to the industry.
 
Would GPGPU really be that much faster on PS4 compared to PC?

Running TressFX on Tomb Raider causes a decent enough performance drop... Would it be less of a hit on console?
 
Back
Top